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Section  0.0 


EXECUTIVE  SUMMARY 


The  quantity  and  importance  of  embedded  software  (SW)  in  modern  weapon 
systems  has  increased  dramatically  over  the  last  decade  and,  with  ever  increasing 
complexity  of  this  SW,  it  has  become  necessary  to  consider  the  impact  of  the  SW 
faults  and  maintenance  actions  on  system  effectiveness.  Whereas  in  systems  of 
several  generations  past  the  SW  contribution  to  system  unavailability  could  reason¬ 
ably  be  neglected,  the  effects  of  SW  faults  on  modern  weapon  systems  are  known 
to  be  substantial.  While  the  state-of-the-art  in  hardware  (HW)  reliability  model¬ 
ing  is  now  well  founded,  and  with  SW  reliability  modeling  presently  approaching 
a  state  where  SW  reliability  requirements  can  be  specified  and  designed  toward, 
there  is  no  unified  technique  for  modeling  the  reliability  /maintainability  scenario 
of  a  system  exhibiting  both  HW  and  SW  faults  and  repairs.  It  is  the  purpose  of 
this  study  to  develop  such  a  unified  technique  so  that  combined  HW/SW  reli¬ 
ability  measures  may  be  specified  and  analyzed  to  lead  to  a  reliability  design 
which  is  adequate  to  meet  mission  requirements.  This  unified  technique  will 
consist  of  developing  a  combined  HW/SW  reliability  model,  i.e. ,  a  mathematical 
model  which  probabilistically  describes,  at  any  point  in  time,  the  state  (in  terms 
of  reliability)  of  a  system  exhibiting  both  HW  and  SW  faults  and  repairs.  What 
follows  represents  a  rudimentary  description  of  the  concepts  and  theory  behind 
this  model. 

There  are  two  basic  components  in  the  combined  HW/SW  reliability  model. 

The  most  fundamental  of  these  is  the  SW  fault  correction  process  which  represents 
the  number  of  faults  remaining  in  the  SW  (and  hence  defines  the  SW  failure  rate 
as  a  function  of  the  proportion  of  faults  remaining)  at  any  point  in  time.  Letting 
X  (t )  represent  the  number  of  SW  faults  at  time  t,  we  assume  that  X(t)  changes 
as  faults  occur  and  are  corrected  (the  implicit  assumption  made  is  that  X(t) 
takes  the  form  of  a  time-homogeneous  Markov  process).  Figure  0.0-1  shows  how 
X(t) ,  and  hence  SW  failure  rate,  changes  as  fault  corrections  take  place  at  the 
random  points  in  time  t^,  t2,  ••••  During  each  interval  between  fault  removals, 

SW  failure  rate  remains  constant,  taking  a  value  proportional  to  the  number  of 
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faults  remaining  in  the  SW.  Notice  that  in  Figure  0.0-1,  any  number  of 
faults  can  be  corrected  (or  even  inadvertently  introduced)  over  any  interval 
of  time.  The  exact  nature  and  characteristics  of  X(t)  are  embedded  in  the 
character  of  the  particular  software  reliability  model  chosen  for  X(t).  This 
is  described  in  detail  in  the  body  of  the  report. 

The  second  component  of  the  combined  HW/SW  reliability  model  is  the  process 
describing  the  operating  condition  of  the  system,  the  "system  state"  (in  terms 
of  which  of  its  components  are  operating  or  not  operating  due  to  system  mal¬ 
functions  and  restorations  caused  by  both  software  and  hardware)  at  any  point 
in  time  within  a  given  interval  defined  by  X(t). 

We  represent  the  possible  system  states  by  the  set  (0,  1,  2,  ....,  J)  where 
we  arbitrarily  assign  the  full-up  state  to  "O"  (implying  none  of  the  system  com¬ 
ponents  are  in  an  inoperative  state).  The  remaining  states  namely  1,  2,  ...  J 
can  be  used  to  represent  any  combination  of  HW/SW  operational  degraded  or 
inoperative  states.  We  represent  the  state  of  the  system  at  time  t  by  Y(t),  and 
it  turns  out  that  (conditioned  on  t  being  in  a  time  interval  between  successive 
changes  in  the  SW  failure  rate)  Y(t)  evolves  according  to  a  time-homogeneous 
Markov  process.  Figure  0.0-2  shows  the  relationship  between  the  two  compo¬ 
nents.  Referring  to  Figure  0.0-2,  beginning  at  time  0,  Y(t)  evolves  according 
to  a  Markov  process  (determined  by  the  fixed  values  of  the  failure  and  repair 
rates  and  the  initial  value  of  the  SW  failure  rate)  until  time  t^,  when  the  SW 
failure  rate  jumps  to  a  new  value.  From  time  t^  until  the  next  jump  in  SW 
failure  rate,  Y(t)  evolves  according  to  a  different  Markov  process,  different 
because  the  SW  failure  rate  changed  at  time  t^.  The  process  Y(t)  continues  in 
this  fashion. 

The  concept  of  availability  of  a  combined  HW/SW  system  may  be  defined 
exactly  as  for  a  system  possessing  only  HW.  Namely,  the  availability  of  the  sys¬ 
tem  at  time  t  is  the  probability  that  the  system  is  capable  of  performing  all  nec¬ 
essary  mission  tasks  at  time  t.  During  periods  of  time  when  the  SW  failure 
rate  is  constant,  portions  of  the  system  (including  SW)  may  incur  malfunctions 
and  repairs  or  restorations  and  hence,  the  system  state  Y(t)  may  transition  from 
one  state  to  another.  Figure  0.0-3  provides  a  simple  three-state  example  of  a 
transition  diagram  when  there  are  j  faults  remaining  in  the  SW.  Naturally, 
when  the  SW  failure  rate  is  high,  the  SW  will  fail  more  frequently.  On  the  other 
hand,  when  enough  SW  faults  have  been  removed  to  effect  a  substantial  reduc¬ 
tion  in  the  SW  failure  rate,  the  system  failures  due  to  SW  become  less  frequent 
and  can  ultimately  be  eliminated  if  the  SW  can  be  totally  debugged.  This  chang¬ 
ing  nature  of  SW  failure  rate  is  responsible  for  an  "availability  undershoot" 
which  is  illustrated  in  Figure  0.0-4.  When  a  system  possesses  no  SW,  and  pro¬ 
viding  that  the  system  state  is  full-up  at  time  0,  then  its  availability  curve 
begins  at  the  value  1  at  time  0  and  decreases  (as  t  increases)  approaching  the 
steady-state  value.  For  this  reason,  steady  state  availability  is  often  specified 
as  a  requirement.  For  example,  for  a  system  with  only  one  HW  unit  (and  no  SW) 
the  "steady  state  availability"  is  MTBF/(MTBF+MTTR) ,  a  well-known  formula  to 
reliability  engineers.  For  this  system,  the  time  dependent  availability  always 
exceeds  the  "steady  state"  value,  MTBF  /  (MTBF+MTTR) ,  and  so  specifying 
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NOTE:  NOTICE  THAT  THE  SW  FAILURE  RATE  IS  PROPORTIONAL  TO  THE  NUMBER  OF  ERRORS  REMAINING 
IN  THE  SW.  THE  TRANSITION  DIAGRAM  IS  ACTUALLY  A  ‘•CONDITIONAL'’  TRANSITION  DIAGRAM 
SINCE  X  (t)  =  k,  k  *  j  WOULD  RESULT  IN  A  DIFFERENT  SW  FAILURE  RATE,  NAMELY 


Figure  0.0-3.  Example  of  a  System  Having  Both  HW  and  SW  Failure  States.  With  X(t)  -  j  SW  faults 
present  in  the  system,  the  (constant)  SW  failure  rate  is  <>j,  and  the  HW  failure  rate  is  XHW.  The  respective 
repair  rates  are  PHW  an<^  ^SW- 
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e  0.0-4.  A(t)  Versus  Time 


"steady-state  availability"  as  a  requirement  is  really  the  same  as  specifying  the 
"worst  case"  value.  As  seen  in  Figure  0.0-4  however,  the  presence  of  SW  can 
cause  the  availability  to  dip  below  the  "steady-state"  value  and  then  approach 
steady-state  from  below.  Notice  in  Figure  0.0-4  that  each  curve  possesses  the 
same  "steady-state"  value,  and  pay  particular  attention  to  the  impact  of  the  SW 
parameters  (these  parameters  are  explained  in  the  report)  on  the  size  of  the 
undershot.  The  implication  of  this  is  clear:  for  systems  with  substantial 
embedded  SW,  "steady-state  availability"  is  misleading.  A  better  measure  is 
minimum  availability. 

Because  of  the  HW/SW  interfaces  and  interactions  in  a  large  system,  HW/SW 
reliability  modeling  requires  that: 

a.  HW/SW  duty  cycles  be  defined  relative  to  each  subsystem  (i.e. ,  a 
relationship  between  each  SW  module  use  time  and  subsystem  hard¬ 
ware  operating  time). 

b.  The  SW  be  partitioned  into  groups  of  modules  which  interact  with 
specific  subsystem  HW  items. 

Thus,  the  combined  HW/SW  model  must  be  applied  to  each  constituent  subsystem 
consisting  of  HW  along  with  its  specific  SW.  Figure  0.0-5  shows  a  typical 
reliability  block  diagram  for  a  Command,  Control,  and  Communications  system. 
Table  0.0-1  gives  an  example  of  partitioning  of  a  hypothetical  Air  Surveillance 
System's  SW  based  on  percent  utilization.  These  data  are  used  to  determine  the 
values  of  the  SW  parameters  to  be  used  in  the  combined  HW/SW  model  (an 
explanation  of  how  to  use  these  data  is  given  in  the  body  of  the  report).  Having 
applied  the  combined  HW/SW  reliability  model  to  each  of  K  constituent  subsystems, 
the  overall  system  reliability  measures  may  be  computed.  For  example,  if  Ai(t)  is 
the  availability  of  subsystem  i  at  time  t  as  determined  by  applying  the  combined 
HW/SW  model  to  subsystem  i,  then  the  system  availability  at  time  t  is  simply  the 
product  Aj(t)  x  A2(t)  x  ...  X  Ak(t)  of  the  subsystem  availabilities.  The  applica¬ 
tion  of  availability  and  other  figures  of  merit  to  complex  systems  with  embedded 
SW  is  discussed  in  detail. 

The  concepts  discussed  above,  along  with  reliability  tradeoff  methodology 
between  HW  and  SW  are  discussed  in  the  body  of  the  report. 
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Figure  0.0-5.  Reliability  Block  Diagram  for  a  System 


Table  0.0-1.  Partitioning  of  an  Air  Surveillance  System  SW  Based  On 
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SUMMARY  OF  STUDY  RESULTS 


The  need  for  operational  versatility  in  modern  weapon  systems  has  dramati¬ 
cally  increased  the  importance  of  the  embedded  software  (SW)  used  in  these 
systems.  This  is  particularly  true  for  multipurpose  systems  which  have  a  large 
number  of  varied  human  interfaces  such  as  command,  control  and  communications 
(C3)  systems.  With  this  high  SW  usage,  increased  emphasis  has  been  placed  on 
SW  reliability.  A  literature  search  conducted  early  in  the  study  revealed  that 
many  efforts  have  been  undertaken  in  recent  years  to  develop  techniques  for 
quantifying  SW  reliability.  As  a  result,  SW  measurement  and  modeling  methodol¬ 
ogy  is  fast  approaching  a  point  where  it  is  on  par  with  hardware  (HW)  reliability 
methodology.  A  large  number  of  SW  reliability  models  have  been  developed  and 
have,  in  varying  degrees  of  success,  been  validated  to  error  data.  Accordingly, 
Hughes  is  currently  under  contract  to  evaluate  ten  of  the  more  promising  SW 
models  using  error  data  collected  from  an  on-going  C3  system*. 

Therefore,  with  the  state  of  SW  reliability  methodology  approaching  that  of 
HW ,  the  next  logical  step  is  to  combine  the  two  disciplines  into  a  common  reliability 
methodology,  which  was  the  purpose  of  this  study.  Specifically,  the  objectives 
of  this  study  were  to  1)  develop  the  necessary  technical  foundation  for  combining 
HW  and  SW  reliability  into  common  figures-of-merit  (FOM's)  which  have  conven¬ 
tional  (i.e.,  consistent  with  HW)  reliability  interpretations,  2)  develop  models/ 
procedures  which  apply  to  HW  versus  SW  tradeoffs  with  respect  to  combined 
HW/SW  reliability  measures,  and  3)  define  specific  FOMs  and  tradeoff  models/ 
procedures  which  are  applicable  to  C3  systems. 

Aside  from  the  usual  failure  and  repair  processes  that  take  place  under 
purely  HW  considerations,  there  are  three  additional  types  of  random  phenomena 
that  must  be  considered  in  deriving  a  combined  HW/SW  reliability  model.  These 
are:  1)  the  SW  failure  process,  2)  the  SW  "repair"  process  (i.e.,  a  remedial 
action  that  restores  the  system  to  an  operational  state  without  correcting  the 
SW  fault),  and  3)  the  SW  fault  correction  process  which,  if  successful,  affects 
the  SW  failure  process.  In  this  investigation,  a  general  methodology  for  combining 
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traditional  HW  reliability  models  with  SW  reliability  models  has  been  developed 
around  the  theory  of  Markov  Processes.  Appendix  B  provides  a  brief  theoretical 
background  in  Markov  Processes  as  they  apply  to  the  HW  and  SW  failure  and 
repair  phenomena  discussed  in  this  report . 

Figure  1.0-1  provides  an  overview  of  the  material  presented  in  this  report. 
The  general  flow  in  the  main  body  of  the  material  is  oriented  toward  the  reliability 
practitioner  as  much  as  possible  with  the  necessary  background  theory  and  most 
of  the  more  complex  developments  provided  in  the  Appendices.  The  development 
of  the  combined  HW/SW  reliability  model  and  associated  reliability  measures  are 
developed  in  generality  (Section  3.0)  without  specifying  the  nature  of  the  SW 
processes.  To  be  a  useful  working  model,  however,  the  SW  as  well  as  the  HW 
processes  must  be  well  defined.  This  is  done  in  Section  4.0  where  the  general 
HW/SW  model  is  applied  to  some  simple  reliability  configurations  (or  constructs) 
using  SW  models  selected  from  the  literature.  Specifically,  the  SW  reliability 
theory  of  Jelinski-Moranda  (  1972) ,  Goel-Okumoto  (  1978) ,  and  standard  HW 
reliability  theory  (e.g.,  as  described  in  Barlow  and  Proschan  (1965,  1975)  and 
Kozlov  and  Ushakov  (1970))  were  incorporated  in  the  general  model  to  produce 
a  working  HW/SW  reliability  model.  The  details  of  going  from  the  general  model 
to  this  working  model  are  provided  in  Appendix  D.  A  similar  development  would 
be  required  for  another  choice  of  HW  and  SW  models.  Computational  aspects 
related  to  the  HW/SW  model  are  presented  and  an  experimental  computer  program 
for  the  model  is  described  and  documented. 

The  extension  of  this  HW/SW  model  to  more  complex  reliability  constructs  is 
given  in  Section  5.0  with  an  example  application  to  a  C3  system.  A  system  model 
is  described  based  on  operational -mission  tasks  using  the  simple  reliability  con¬ 
structs  of  Section  3.0  as  "buildng  blocks".  An  attempt  is  also  made  at  unifying 
failure  and  repair  concepts  based  on  the  previously  selected  HW  and  SW  models. 

Finally,  a  HW/SW  tradeoff  procedure  is  established  in  Section  6  using  various 
interpretations  of  the  combined  HW/SW  availability  measure.  Isometric  availability 
curves  are  used  to  specify  feasible  ranges  of  HW  versus  SW  complexity  in  terms 
of  failure  rate.  Example  applications  to  specific  HW/SW  reliability  constructs  are 
also  provided. 
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Figure  1.0-1.  General  Arrangement  of  Sections  Within  the  Technical  Report 


Section  2.  0 


RESULTS  OF  LITERATURE  SEARCH 


Eight  separate  literature  searches  were  conducted  through  various  agencies  in 
order  to  identify  and  evaluate  sources  of  information  pertaining  to  combined  hardware/ 
software  reliability  models.  These  searches  include  all  journals,  reports,  and  tech¬ 
nical  information  on  software  reliability  and  combined  software/hardware  reliability 
published  since  1975. 

Literally  thousands  of  references  were  reported  in  these  searches,  but  only  one 
precedent  was  discovered  for  mathematical  modeling  of  systems  experiencing  both 
hardware  and  software  failures  (Costes,  et.  al.  ,  1978).  Many  useful  references  to 
software  reliability  mathematical  models  were  found  along  with  the  well-known  work 
in  hardware  reliability  models  (e.  g.  Barlow  &  Proschan  1965,  1975;  and  Kozlov  & 
Ushakov  1970).  A  combined  list  of  references  and  bibliography  containing  the  high¬ 
lights  of  the  literature  search  is  presented  in  Section  8.  0. 

There  are  several  controlling  ideas  in  past  attempts  at  modeling  software 
reliability.  Many  authors  agree  that  software  errors  manifest  themselves  according 
to  a  type  of  Poisson  process.  For  example,  Jelinski  &  Moranda  (1971,  1972,  1975, 
1976)  model  the  software  failure  process  by  assuming  a  constant  failure  rate 
between  bug  removals.  A  variation  on  this  are  the  models  of  Schick  and  Wolverton 
(1973,  1978)  which  assume  a  piece-wise  continuous  failure  rate.  Shooman  (1972, 

1973,  1975,  1978)  proposed  a  model  conceptually  similar  to  that  of  Jelinski  & 

Moranda  with  a  software  failure  rate  proportional  to  the  number  of  faults  remaining 
in  the  software.  A  continuous  analog  to  these  models  is  the  non-homogeneous 
Poisson  Model  of  Goel  &  Okumoto  (1980)  whose  error  correction  rate  is  a  monotone 
decreasing  continuous  function  of  time. 

There  are  many  apparent  controversies  involving  past  modeling  attempts.  One 
such  controversy  stems  from  the  time  scale  involved  in  the  models.  Musa  (1975) 
argues  that  time  should  be  measured  in  execution  time  while  other  authors  often  do 
not  make  such  a  specification,  implying  universal  applicability  with  respect  to 
time  scales.  It  was  concluded  in  Schafer  et.  al.  (1979)  that  the  failure  of 
many  popular  software  reliability  models  to  adequately  fit  the  available  soft¬ 
ware  failure  data  was  due  primarily  to  improper  control  of  the  software  testing 
intensity  with  respect  to  time. 
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Another  controversy  involves  the  applicability  of  models  with  respect  to  software 
failure  definitions.  There  is  no  standard  failure  definition  for  software  with  regard 
to  operational  mission  although  it  is  implied  in  the  literature  that  the  models  are 
applicable  to  software  failures  under  any  definition.  This  cannot  be  so  since  there 
are  important  structural  differences  between  software  "bug's"  (e.g. ,  documentation 
errors  vs.  logic  errors). 

Finally,  there  are  certain  features  which  are  thought  by  most  authors  to  be  very 
important  for  a  software  reliability  model  to  possess.  The  model  must,  of  course, 
be  mathematically  tractable  as  were  most  models  revealed  in  the  literature  search. 
Although  no  one  model  was  especially  simple  to  apply,  the  procedures  for  the  most 
part  are  easily  implemented  on  a  digital  computer. 

Other  features  which  a  software  reliability  model  should  have  are  the  ability  to 
account  for  imperfect  debugging  (Goel  &  Okumoto,  1978,  1980),  the  ability  to  account 
for  the  phenomenon  of  recovering  the  working  system  without  fixing  the  software  fault 
(Costes,  et.  al.  1978),  software  maintenance  (Trivedi  &  Shooman  1975,  Costes,  et.  al. 
1978),  and  a  mechanism  by  which  faults  generated  as  a  direct  result  of  erroneous 
input  data  (as  opposed  to  program  bugs)  can  be  handled  (Littlewood  1979). 

As  mentioned  earlier,  the  literature  search  turned  up  many  references  on  hard¬ 
ware  reliability  modeling  but  most  are  derived  on  the  premises  set  forth  in  the  classi¬ 
cal  works  of  Barlow  &  Proschan  (1965,  1975),  Feller  (1957,  1966),  Kozlov  &  Ushakov 
(1970),  Lloyd  &  Lipow  (1962),  and  Mann  et.  al.  (1974).  By  in  large,  these  authors 
treat  hardware  reliability-maintainability  models  from  the  standpoint  of  continuous¬ 
time  Markov  processes.  Implicit  in  their  models  is  the  assumption  of  constant  failure 
and  repair  rates,  the  latter  of  which  is  not  unreasonable  in  view  of  modern  fault 
isolation  capabilities. 

Costes  et.  al.  (1978)  studied  the  combined  hardware/software  reliability  problem 
from  a  semi-Markov  point  of  view.  Although  their  approach  is  straightforward  and 
simple  for  simple  system  configurations,  for  a  complex  system,  their  approach  be¬ 
comes  intractable  with  only  steady  state  results  being  available.  Indeed,  one  cannot 
even  write  down  a  complete  system  state  diagram  easily  for  a  moderately  complex 
system  structure  due  to  the  exhaustive  manner  in  which  system  states  are  defined. 
Their  semi-Markov  approach  shows  definite  promise  for  combined  hardware/software 
reliability  models  but  places  too  much  importance  on  extraneous  system  states. 
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THEORETICAL  FOUNDATIONS  OF  COMBINED  HW-SW  RELIABILITY  MODELS 

3.1  INTRODUCTION 

This  section  includes  a  theoretical  discussion  of  the  approach  to  combining 
HW  and  SW  reliability  models.  A  short  section  providing  background  information 
on  purely  discontinuous  Markov  processes  is  included  in  Appendix  B  since  they 
are  the  foundation  of  hardware  reliability  models  (cf.  Kozlov  &  Ushakov,  1970)  and 
play  an  integral  part  in  the  combined  model.  The  combined  HW/SW  reliability 
model  and  its  associated  reliability  measures  are  first  derived  for  general  SW 
failure  and  repair  processes.  Section  4.  0  then  provides  a  derivation  from  the 
general  model  based  on  some  simplifying  assumptions  concerning  the  nature 
of  the  SW  process.  Example  applications  using  simple  series  systems  and 
redundant  systems  with  single  unit  replications  will  be  used  to  examine  the 
effects  on  reliability  of  combining  the  HW  and  SW  processes.  In  Section  5.  0,  the 
methodology  will  be  extended  to  more  complex  systems  involving  series-parallel 
constructs.  Specifically,  the  methodology  will  be  applied  to  a  typical  command, 
control  and  communication  system. 

3.2  THE  GENERAL  COMBINED  HARDWARE /SOFTWARE  RELIABILITY  MODEL 

The  general  approach  of  combining  HW  and  SW  reliability  models  is  based  on 
the  widely  accepted  propositions  that :  ( 1)  the  state  of  a  hardware  system  is 

adequately  described  by  a  time -stationary  Markov  process  (cf.  e.g.  Kozlov  & 
Ushakov  1970)  and  (2)  SW  possesses  a  constant  failure  rate  between  SW  fault 
corrections  (Jelinski  &  Moranda  1972,  Lipow  1974,  Lloyd  &  Lipow  1978,  Shooman 
1972,  Trivedi  &  Shooman  1975).  It  should  be  pointed  out  that  (1)  requires  the 
assumption  that  the  HW  repair  times  of  maintained  systems  are  exponentially  dis¬ 
tributed  which  is  not  unreasonable  in  view  of  modern  fault  isolation /detection 
capabilities. 

In  combining  (1)  and  (2)  above  into  a  HW/SW  reliability  model  it  is  important 
to  first  outline  the  features  that  such  a  model  should  have.  Obviously,  the  model 
should  account  for  both  HW  and  SW  failures  occurring  in  time  and  the  repair 
thereof.  The  model  should  allow  for  the  recovery  of  the  SW  operating  system 
after  the  manifestation  of  a  fault  without  necessarily  removing  the  fault  (i.e. 

S'V  startover/switchover  capability).  The  model  should  allow  for  but  not  neces¬ 
sarily  be  based  on  an  independent  SW  support  facility  which,  knowing  the  SW 
contains  faults,  works  concurrently  during  system  operations  to  uncover  and 
remove  such  faults  before  they  cause  substantial  system  down  time.  Such 
support  facilities  are  common  on  systems  possessing  complex  SW  such  as  command 
control  and  communications  (C3)  systems.  Finally,  when  an  attempt  is  made  at 
correcting  a  SW  fault,  the  model  should  allow  for  the  possibility  that  this  attempt 
is  unsuccessful. 


This  fault  correction  process  is  characterized  by  a  purely  discontinuous  time 

stationary  Markov  process  (X(t),  t>0}  taking  values  in  the  set  (0,  1 . N} 

which  represents  the  number  of  faults  remaining  in  the  SW.  Let  0  =  to<ti, 

<...  <tk< _ denote  the  random  times  at  which  the  process  X(t)  changes 

values.  By  construction,  the  process  X(t)  is  a  right-continuous  step  function. 

So,  for  t  in  the  interval  [t,  t, +1),  X(t)  is  constant  taking  the  value  at  t^. 

If  the  process  is  eventually  absorbed  at  0  (i.  e. ,  no  SW  faults  remaining 
in  the  system),  then  the  sequence  of  jump  times  will  be  finite,  say 
to<ti,  <  ...  <tM,  for  some  (random)  integer  M. 

Now  let  the  possible  HW/SW  system  states  be  represented  by  0,  1,  ...  ,  J 
where  "0"  represents  the  full-up  state  and  1  through  J  represent  various  HW 
and  SW  degraded  or  failed  states  and  let  Y(t)  be  the  state  of  the  system  at 
time  t>0.  Because  SW  possesses  a  constant  failure  rate  as  long  as  X(t)  is  con¬ 
stant  (i.e. ,  X(t)  is  constant  in  [tk,  tk+i)),  then  during  such  periods  of  time, 
it  is  reasonable  to  model  Y(t),  the  system  state  at  time  t,  as  a  time-stationary 
Markov  process.  Figure  3.2-1  gives  an  example  of  a  system  having  both  HW 
and  SW  failure  states  where  Aj  represents  the  (J+l)x(J+l)  transition  rate  (or 
infinitesimal  matrix,  see  B.l-8)  corresponding  to  the  system  state  transition 
diagram  when  X(t)  =  j,  i.e. ,  when  there  are  j  bugs  in  the  SW  (see  Figure  3.2-1). 

Suppose,  therefore,  that  the  state  of  the  system  at  tk  is  such  that  X(tk)  =  j, 

Y(tk)  =  £.  Then  the  conditional  probability  that  the  HW/SW  system  is  in  state  m 
at  time  t  given  the  state  cf  the  system  at  time  tk  is : 

P  { Y(t)  =  m  |tk<t<tk+i,  X(tk)  =j,  Y(tk)  =(,  tk  =  p,  tk+1 

=  Pj(t-p,  f,m)  (3.2-1) 

where  pj  is  the  solution  to  the  system  of  differential  equations  (B.  1-7)  with  A  replaced 
by  Aj,  the  initial  condition  is  given  by  pWO,  l, l  )  =1,  and  p  and  v  are  the  values  of  the 
random  times  tk  and  tk+1-  The  unconditional  distribution  of  Y(t)  can  now  be  written  as: 


j  n  r 

P{Y(t)  =  m}  =  EEE  /V-  .  t.m)  P  jtk£t<tk+i, 

?=0  j=0  k>0  p  <  v 

X(tk)  =  j,  Y(tk)  =  ?,  tk  e[u  .  y+du),  tk+1  c[v.dv)| 


(3.2-2) 


A  typical  sample  path  of  Y(t),  t>0,  can  thus  be  described  as  follows.  At 
time  t  =  0,  Y  begins  at  state  0  (i.e.,  fully  operational)  and  X(0)  =  N  (i.e., 

N  faults  are  present  in  the  SW).  From  time  0  until  1 1 ,  Y(t)  evolves  according  to 
time  homogeneous  Markov  process  with  infinitesimal  matrix  An  and  initial  value  0. 
At  time  1 1 ,  X  jumps  from  its  initial  value  N  to  some  other  state  ke  {0,  1,  . . . ,  N  - 1 } . 
Then,  for  ti<t<t2,  Y(t)  evolves  according  to  a  new  Markov  process  with 
infinitesimal  matrix  Ak  and  initial  value  Y  ( t  j ) .  The  process  continues  in  this 
fashion  (see  Figure  3.2-2). 
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(a)  TRANSITION  DIAGRAM  OF  A  SYSTEM  HAVING  BOTH  HW  AND  SW  FAILURE  STATES. 


V  i 


<>) 


sw 


HW 


MSW 

0 


HW 

0 

mhw 


<b)  INFINITESIMAL  MATIRX  FOR  HW/SW  SYSTEM 


NOTE:  THE  TRANSITION  DIAGRAM  (a)  IS  ACTUALLY  A  ''CONDITIONAL''  TRANSITION  DIAGRAM 

SINCE  X  (t)  =  k,  K  *  j  WOULO  RESULT  IN  A  DIFFERENT  SW  FAILURE  RATE,  NAMELY  OK 
SIMILARLY,  THE  INFINITESIMAL  MATRIX  (  b)  IS  ALSO  CONDITIONED  ON  THE  NUMBER 
OF  SW  FAULTS  IN  THE  SYSTEM. 


Figure  3.2-1.  Example  of  a  System  Having  Both  HW  and  SW  Failure  States.  With  X(t]  =  j 
SW  faults  present  in  the  system,  the  (constant)  SW  failure  rate  is  <p j.  and  the  HW'  failure  rate  is 
^HW-  ^he  respective  repair  rates  are  Mpiw  an<^  ^SW- 
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Figure  3.2-2.  Sample  Paths  of  Y(t),  t  >  0,  X(t),  t  >  0 


The  process  Y(t)  is  not  a  time -stationary  Markov  process  in  general, 
although  it  is  in  a  sense,  conditionally  a  Markov  process.  However,  it  should 
not  be  expected  that  the  state  of  a  system  possessing  both  hardware  and  soft¬ 
ware  faults  be  a  time  stationary  Markov  process  because  SW  reliability  "improves" 
as  faults  are  corrected.  Thus  a  naive  approach  to  modelling  based  on  strictly 
constant  SW  failure  rate  is  not  realistic. 

Because  of  (3.2-1)  and  (3.2-2)  and  in  view  of  the  typical  sample  path 
behavior  shown  in  Figure  3.2-2,  the  joint  process  (Y(t),  X(t))  can  be  viewed 
as  a  Markov  process  with  a  random  environment  (the  random  environment  being 
the  Y(t) -parameter  values  depending  on  X(t)).  Such  models  are  not  new.  For 
related  processes,  refer  to  Athreya  &  Karlin  (1971,  1971A),  Kaplan  (1973), 
Purdue  (1974),  Smith  (1968),  Smith  &  Wilkinson  (1969,  1971),  Solomon  (1975), 
and  Torrez  (1978,  1979),  and  Cogburn  &  Torrez  (1981). 
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(a)  transition  diagram  of  a  system  having  both  hw  and  sw  failure  states. 
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(b)  INFINITESIMAL  MAT1RX  FOR  HW/SW  SYSTEM 


NOTE:  THE  TRANSITION  DIAGRAM  (a)  IS  ACTUALLY  A  "CONDITIONAL"  TRANSITION  DIAGRAM 

SINCE  X  (t)  =  k,  k  *  i  WOULD  RESULT  IN  A  DIFFERENT  SW  FAILURE  RATE.  NAMELY  OK 
SIMILARLY,  THE  INFINITESIMAL  MATRIX  (  b)  IS  ALSO  CONDITIONED  ON  THE  NUMRER 
OF  SW  FAULTS  IN  THE  SYSTEM. 


Figure  3.2-1.  Example  of  a  System  Having  Both  HW  and  SW  Failure  States.  With  X(t)  =  j 
SW  faults  present  in  the  system,  the  (constant)  SW  failure  rate  is  0j,  and  the  HW  failure  rate  is 
^HW-  The  respective  repair  rates  are  and 
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3.3  RELIABILITY  MEASURES 

Equation  (3.3-2)  can  be  used  to  define  the  usual  measures  of  reliability. 
The  system  availability  at  time  t  is  defined  as  the  probability  that  the  system  is 
capable  of  performing  the  mission  tasks  at  time  t.  Hence,  the  availability, 
denoted  by  A(t),  is 

A(t)  =  2  P{Y(t)  =  m}  (3.3-1) 

meOP 


where  OP  is  the  set  of  successful  operating  states  ( i . e . ,  OP-  {m:m  is  an 
operating  state})  and  the  summation  is  over  all  such  operating  states. 

The  system  mean  time  to  failure  (MTTF)  starting  from  an  operational  state 
is  obtained  by  first  deriving  the  distribution  of  the  time  to  go  from  a  given  opera¬ 
tional  state  to  the  first  visit  to  a  failed  state.  The  general  model  (3.2-2)  is 
derived  under  the  conditions  that  the  class  of  failed  states  for  Y(t)  be  an 
absorbing  class;  i.e.,  the  failed  states  in  the  transition  diagrams  from  which 
the  pj(t -u,2.,m)  are  derived  in  (3.2-2)  have  arrows  pointing  to  them  but  none 
pointing  out  of  them  leading  to  operational  states.  For  example,  the  transition 
diagram  of  Figure  3.2-1  would  have  no  return  arrows  from  the  failed  states 
1  and  2.  This  is  equivalent  to  zeroing-out  the  repair  rates  leading  from  failed 
states  to  operational  states.  Then  starting  in  an  operational  state,  say  state  i, 
the  time  Tjp  until  the  system  fails  for  the  first  time  has  distribution  defined  by 

P{T.F>t|Y(0)  =  i}  =  P  ( Y  ( t )  =  m  |  Y(0)  =  i).  (3.3-2) 

meOP 

Then  the  MTTF  starting  from  state  i  is  the  integrated  reliability  function  defined 
by 


MTTF.  -  J  ^  P{Y(t)  =  m  |Y(0)  =  i}dt  (3.3-3) 

0  meOP 

Under  the  same  conditions  governing  (3.3-2)  and  (3.3-3),  the  probability  of 
failure-free  operation  starting  in  operational  state  i  and  of  duration  t  is  given 
by 


P(t)  =  P  {T.F>t|Y(0)  =  i  |  . 


(3.3-4) 


Another  measure,  the  non -stationary  reliability  coefficient,  is  defined  as 
the  probability  that  a  system  is  operational  at  a  moment  t>0  and  then 
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operates  failure -free  up  to  the  moment  t  +  t0,  to>0  (Kozlov  &  Ushakov,  1970, 
pp.  28).  Defining  this  quantity  as  R(t,t0),  then 


R(t ,tQ)  =  P  (Y(t)eOP,  Y(t+to)eOP}  .  (3.3-5) 

Since  the  process  Y(t)  is  generally  neither  Markov  nor  time  -  stationary ,  (3.3-5) 
is  difficult  to  compute.  However,  under  Certain  conditions  Y(t)  will  behave 
asymptotically  as  a  time  stationary  Markov  process  (see  Appendix  B)  so  that 
R(t,t0)  can  be  approximated  by  R(®,to)  to  yield  a  "steady-state"  reliability 
coefficient  which  can  be  shown  to  be  equal  to 

RC.V  =  ]T  pm(t0),(m)  (3.3-6) 

meOP 


where  the  quantities 

tt( m )  =  lim  P{Y(t)  =  m|Y(0)  =  k) 
t-»-  <» 


are  assumed  to  exist  and  be  independent  of  k,  and  Pm  (see  3.3-4))  is  computed 
under  "steady -state"  conditions. 
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Section  4.0 


APPLICATION  OF  METHODOLOGY  TO  SIMPLE  RELIABILITY  CONSTRUCTS 


In  this  section  some  criteria  will  be  presented  for  selecting  models  for  the 
SW  process  {X(t),  t>0}  and  these  criteria  will  be  used  to  show  how  the  calcula¬ 
tion  of  (3.2-2)  can  be  accomplished. 

A  particular  Markov  model  for  the  SW  process,  namely  the  Goel/Okumoto 
SW  process  described  in  Appendix  C,  will  be  adopted  for  the  general  combined 
HW/SW  reliability  model  (3.2-2).  With  this  adaptation  the  model  will  be  analyzed, 
and  expressions  for  the  state  occupancy  probabilities  and  reliability  measures 
defined  in  Section  3.2  will  be  derived.  Computational /numerical  aspects  of 
employing  the  baseline  model  will  be  discussed  and  several  specific  reliability 
constructs  will  be  analyzed  using  the  baseline  model.  Finally,  the  possibilities 
for  employing  other  processes  for  the  SW  model  process  (not  necessarily  those 
satisfying  the  criteria  discussed  earlier)  will  be  discussed. 

Employing  the  basic  assumptions  of  the  Goel/Okumoto  model  in  the  SW 
process  X(t),  therefore,  the  general  HW/SW  model  (3.2-2)  becomes 


P{Y(t)  =  n|Y(0)  =  i} 


-Net 

e 


PN(t,i,n) 


N-l  j  N-l 

*  II  I 

j=0  1=0  m=0 


(?)  (Nm')('1,m+1 


Pj(s,0,n)e 


-c(M-m)sds 


(4.1-1) 


where  the  process  Y(t),  as  before,  takes  values  in  the  system  state  space 
{ 0,  1,  ....  J}  with  Y(0)  =  i  (i.  e. ,  the  system  starts  in  state  i),  and  the  process 
X(t)  takes  values  in  (0,  1,  . . . ,  N}  with  X(0)  =  N  (i.  e. ,  the  SW  initially 
contains  N  faults).  The  parameter  c  =  \p  is  the  rate  of  fault  correction 
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where  p  is  the  probability  of  a  "perfect"  debug  and  >.  is  the  rate  of  maintenance 
troubleshooting. 


The  detailed  development  of  (4.1-1)  from  (3.2-2)  using  the  Goel/Okumoto 
SW  model  is  provided  in  Appendix  D . 

In  the  subsections  below,  (4.1-1)  will  be  applied  to  some  simple  reliability 
configurations  (or  constructs)  which  involve  series  systems  and  redundant  systems 
that  are  replications  of  a  single  unit.  Figure  4.1-1  gives  several  examples  of 
these  simple  reliability  constructs.  Complex  reliability  constructs  (series- 
parallel  configurations)  are  considered  in  Section  5.0. 

4.1  ANALYSIS  OF  SERIES  CONSTRUCTS 

In  the  series  examples  detailed  below  the  solutions  to  the  Kolmogorov 
differential  equations  can  be  obtained  in  closed-form.  Although  in  general  it 
will  be  necessary  to  compute  the  reliability  measures  on  a  computer,  for  these 
cases  it  will  not  be  necessary  to  numerically  solve  systems  of  differential 
equations. 


Figure  4.1-1.  Examples  of  Simple  Reliability  Construct. 
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4.1.1  Two  State  Series  Case 

In  this  example  it  will  be  assumed  that  there  is  one  HW  unit  and  one  SW 
"unit"  in  a  series  configuration.  The  SW  failure  rate  will  be  t>j  where  there  are 
j  faults  in  the  SW ,  i.e.,  the  SW  failure  rate  will  be  <j>X(t).  The  HW  failure  rate 
(plus  any  constant  component  of  SW  failure  rate)  will  be  Xh.  For  this  example, 
we  will  not  distinguish  between  the  different  failed  states  (i.e.  HW  down,  SW 
down)  but  will  assume  that  the  rate  of  repair  from  the  failed  state  is  p  ,  regard¬ 
less  of  the  failure  type.  The  transition  diagram  is  (when  X(t)  =  j)  given  by 


0CC 

FULL-UP 


33 

FAILURE 


In  this  case,  there  are  two  distinct  system  states  as  indicated  in  the  transition 
diagram  (i.  e. ,  Y(t)  takes  values  in  {0,  l}).  The  infinitesimal  matrix  associated 
with  this  transition  diagram  is 


+  0j)  (Ah  +  *j) 

V 

-F  .  . 

The  Kolmogorov  differential  equations  are  (cf.  (B.l-7)): 


p.(t,0,0)  =  -(\H  +  0j)  p.(t,0,0)  +np.(t,0,l) 

p^t.o.i)  =  -MPj(t,o,i)  +  uH  +  0j)  p.(t,o,o) 

with  initial  conditions  pj(0,0,0)  =  1,  pj(0,0,l)  =  0.  The  solutions  for  t>0  are 
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pj(t’0,0)  (XH+0j  +  *O  + 


(Xpj  +  0j) 

Pj(t,0,1)  =  (XH  +  0j  +  M) 


-CXH+0)+M)t 
(XH  +  0j)e _ 

(X„  +  0]  +  P) 


(XH  +  0j)e 


-(XH+0j+P)t 


(XH  +  0)  +  P) 


The  integration  in  (4.1-1)  is  easily  performed  to  yield 


f 


(s  0  0)e"c(i'i'm)sds  =  (1~e  — 

j(s’°’0)e  as  c(j-l-m)(\H  +  0j  + 

/  -[XH+0j+P+c(j-l-m)]t\ 

(XH  +0j)  yl-e  _ l_ 

(XH  +  0]  +  M)[Xh  +0j  +  P  +  c(j-f-m)] 


rl 

I  p.(s,0,l)e 

Jo 


-c  ( j-  i-m)s,  _  (XH  *])  (1~e _ \ 


(X„  +  0j  +  p) 


+  *>  I1' 


[XH+0i+P+c(j 


j-i-m)]t^ 


(4.1-2) 


(X„  +  0j  +  P)[X„  +  0j  +  P  +  c(j-£-m)] 


In  these  expressions,  it  should  be  noted  that 


- =  t  when  a  =  0. 

a 

Substituting  these  expressions  into  (4.1-1)  with  i  =  0  and  a  fixed  value  for  N 
will  give  expressions  for  P  {  Y(t)  =  q  } Y ( 0)  =  0  },  q  =  0,  1  which  are  easily  pro¬ 
grammed  on  a  computer.  Alternatively,  the  computer  program  in  Appendix  F 
could  be  used.  Since  there  is  only  one  operational  state,  the  availability  is 

A(t)  =  P{  Y(t)  =  1  |Y (0)  =  0}  . 
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To  compute  MTTF,  P(to) .  and  R(® ,  t0)  it  is  necessary  to  make  state  l=iF 
absorbing  in  which  case  the  transition  diagram  is  now  (when  X(t)  =  j): 


The  infinitesimal  matrix  for  this  diagram  is 


+  0j)  (A^  +  0j) 

A.  = 

1 

.0  0 

and  the  Kolmogorov  equations  are  then 


~  p.(t , 0, 0)  =  -(XH  +  #j)  p^t.0,0) 


^  Pj(t,0, 1)  =  (XH  +  0j)  p.  (t,0,0) 


with  initial  conditions  as  before.  The  solutions  are 

-(X„+0j)t 

p.(t,0,0)  =  e 

-(XH+0j)t 

Pj(t , 0, 1)  =  1-e  ”  (4.1-3) 


Using  (4.1-2)  it  is  seen  that 


lim  p0(t,0,0)  =  t(0)  =  p/(A„  +  P)  =  lim  p{Y(t)  =  0} 

t— »OD  t  *°° 

and 


lim  p0(t,0,l)  =  ir(l)  =  ^H/(A>H  +  P)  =  lim  P{Y(t)  =  1} 

t  -*co  \  *co 

which  are  the  same  results  which  would  be  obtained  if  there  were  no  SW  in  the 
system . 
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Using  (4.1  3)  and  since  there  is  only  one  operational  state,  the  steady  state 
reliability  coefficient  is 


R(«, 


V  = 


Vo 


(M 


+  XH> 


The  remaining  measures  are  computed  by  using 


-  (Aj,+^j)u 

1  -  p.(u,0,ip)  =  e 

in  (D.2-14)  and  then  employing  (3.3-2),  (3.3-3),  and  (3.3-4). 

4.1-2  Three  State  Series  Case 

In  this  example,  there  is  again  one  HW  unit  and  one  SW  unit  in  a  series 
configuration.  In  this  case  the  failed  states  will  be:  HW  down,  SW  operational 
(HS);  SW  down,  HW  operational  (HS).  The  full-up  state  will  be  denoted  by  HS. 
The  transition  diagram  is  (when  X(t)  =  j)  given  by: 


HS 


The  infinitesimal  matrix  associated  with  this  diagram  is 


-(•j  +  *H)  0j  *H 

^s  -**s  0 

>  0  “h 


The  Kolmogorov  differential  equations  are 


_d_ 

dt 


p^t.O.O) 


-(0j  +  *H)  Pj(t,0,0) 


+  p.(t,0,l)  +  PH  p.(t .  0, 2) 
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Then  the  solutions  to  the  differential  equations  are 


rlit  r2il 

p.(t,0,0)  =  d.  +  Ci;e  J  +  C„.e  J 


p.(t .0, 1)  = 
1 


J  JJ 
«>jd: 


2j 


(/«■  .  /«*) 


S  1] 


_»iSi  .  (/2j«  .  /li‘) 

«*s  +  r2j>  '  ' 

n  o\  Xw<ij  (a  ^  r  (  W  ~Mh1^ 

)j(t’0’2)  =  V  6  /  +XHCli  Ve  '  e  / 


(/•< 


lLC2j  (eV  . 

I  +  r9<)  '  / 


H  2j 


with  the  provision  that 


at  bt 
e  -  e 

a-b 


St 


(4.1-4) 


when  a  =  b. 

The  integrations  in  (4.1-1)  are  straightforward  but  will  not  be  exhibited  because 
the  notation  is  cumbersome.  Nevertheless,  (4.1-1)  is  now  easily  programmed  on 
a  computer  to  yield  the  state-occupancy  probabilities  and  availability. 

By  collapsing  the  failed-states  into  one  absorbing  failed  state  lsip  the 
state  diagram  becomes: 


and  the  solution  to  the  Kolmogorov  equations  in  this  situation  is  identical  to 
(4.1-3)  with  n  -  so  that  the  last  comment  of  Section  4.1-1  applies  here 
also.  Letting  t -*■ »  in  (4.1-4)  gives 


lim  p0(t,0,0)  =  7r(0)  =  +  =  lim  p^Y(t)  =  °} 

t  t  — ® 
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lim  p  (t,0,l)  =  tt(1)  =  0  =  lim  P{Y(t)  =  1} 

t— CD  t— -CO 

lim  p0(t,0,2)  =  7r(  2)  =  Ah/(Ah  +  MH)  =  lim  P{Y(t)  =  2}. 

{—*00  {—►CD 

The  steady-state  reliability  coefficient  is  therefore  given  by 

R(®.V  ={mh/(XH  +  e  H° 

The  expressions  (4.1-4)  were  used  in  (4.1-1)  with 


0  =  1 
N  =  5 

Ms  =  =  2 

Ah  =  0.004 

c  =  \p  =  0. 95 

and  programmed  on  a  computer  (a  description  of  the  program  is  provided  in 
Appendix  F) .  A  plot  of  availability  as  a  function  of  time  is  pictured  in  Fig¬ 
ure  4.1-2.  The  steady  state  value  is  (since  there  is  only  one  operational  state) 
yfi/(lH  +  u  F{)  =  2/2.004  =  0.9980. 
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The  severe  "undershoot"  of  availability  beneath  the  steady-state  is  note¬ 
worthy.  This  behavior  is  typical  whenever  SW  failures  occur  often  enough  to  be 
important  without  being  removed  very  quickly.  Figure  4.1-3  shows  additional 
plots  for  combinations  of  <j>  (effecting  SW  failure  rate)  and  c;  Ap  (effecting  the 
need  of  fault  corrections).  As  <j>  decreases  (c  fixed)  the  undershoot  becomes 
less  severe  and  eventually  non-existent.  Of  course,  as  <j>  decreases,  this  means 
that  the  faults  existing  in  the  SW  do  not  manifest  themselves  often  so  that  in  the 
time  it  takes  to  remove  the  faults,  the  SW  does  not  fail  often  enough  to  cause 
severe  undershoot.  The  constant  c  determines,  roughly,  the  rate  at  which 
faults  are  removed  from  the  SW.  When  c  is  small,  it  causes  the  rise  from  the 
undershoot  to  be  very  long  and  slow,  thus  postponing  steady-state  conditions 
greatly . 

Most  importantly,  these  considerations  suggest  that  instead  of  using  the 
traditional  steady-state  availability  as  a  single  measure  of  effectiveness  it 
would  be  better  to  consider  the  minimum  value  of  availability,  since  each  graph 
in  Figures  4.1-2  and  4.1-3  possesses  the  same  steady-state  value. 

4.2  ANALYSIS  OF  CONFIGURATIONS  WITH  REDUNDANT  HARDWARE 

In  this  section  a  configuration  with  redundant  HW  units  interacting  with  SW 
will  be  discussed.  A  general  transition  diagram  and  infinitesimal  matrix  will 
be  described  and  some  examples  will  be  analyzed  using  the  computer  program 
documented  in  Appendix  F.  Modifications  to  incorporate  various  maintenance 
scenarios  will  also  be  discussed. 

4.2.1  Transition  Diagrams  and  Infinitesimal  Matrices 

For  this  example  it  is  assumed  that  there  are  N^  identical  HW  units,  M  of 
which  are  required  for  operation  of  the  HW  portion  of  the  system.  For  now,  it 
will  be  assumed  that  the  HW  standby  units  are  operational  (i.e.  ,  hot  standby), 
and  that  each  HW  unit  is  accessing  the  SW  equally.  It  will  also  be  assumed  that 
only  one  HW  unit  can  be  repaired  at  a  time,  and  that  additional  HW  units  do  not 
fail  while  the  SW  is  down  (e.g.,  the  SW  is  being  patched).  The  transition 
diagram  corresponding  to  this  scenario  (when  X(t)  =  j)  is  shown  in  Figure  4.2-1. 

In  Figure  4.2-1,  states  are  assigned  integer  values  and  labeled  according 
to  the  condition  of  the  system.  For  example,  Hj^S  is  the  state  in  which  there  are 
k  HW  units  down  but  the  SW  is  operational.  Similarly,  H^S  is  the  state  in  which 

there  are  k  HW  units  down  and  the  SW  is  down.  The  states  "H^S"  k  =  0,  1 . 

nh  -M  are  system  failed  states,  while  the  states  H^S,  k  <Npj-M,  are  operational 
states,  although  for  k  >1,  they  may  be  degraded  operational  states. 

Notice  that  the  rate  of  transition  from  H^S  to  H^S  is  OfN^-k)]  since  there 
are  k  less  HW  units  accessing  the  SW.  If  it  is  desired  to  have  unlimited  repair 
of  the  HW  (i.e.,  unlimited  repair  resources),  then  the  value  of  pH  must  be 
multiplied  by  suitable  constants  depending  on  the  number  of  HW  units  failed 
at  each  transition.  Notice  that  in  this  case,  we  are  tacitly  assuming  that  all  of 
the  HW  units  are  accessing  the  same  SW  unit,  rather  than  employing  redundant 
SW  units. 
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The  infinitesimal  matrix  associated  v/ith  Figure  4.2-1  is  easily  written 
down  for  specific  cases  but  is  cumbersome  to  express  in  general.  However,  it 
is  clear  what  the  entries  should  be  from  the  figure.  For  example,  denoting 


to  be  the 

(x,y)  entry 

in  matrix  A.(cf,  B. 

1-8): 

A.(0,1)  = 

0NHj,  A.(0, 

2)  =  NhV  Aj(°«°)  : 

=  -(0NHj+NHX),  A.(0,k)  =  0 

for  k  >  2 ; 

Aj(  1 , 0)  = 

Ms>  V1’1) 

=  -ns,  A.(  1  ,k)  =  0  for  k 

>1; 

Aj( 2, 0)  = 

MH>  A .  ( 2 , 1 ) 

=  0,  Aj(2,3)  =0(Nh 

-Di 

,  A.( 2 , 4)  =  (Nh-1)Ah, 

Aj(2 , 2)  =  - 

[MH+0(NH-l)j+(NH-l 

V 

.  A.(2,k)  =  0  for  k  >  4 

etc . 

Figure  4.2-1.  Transition  Diagram  for  Configuration  with  Redundant  HW 


In  general,  to  determine  the  value  of  Aj(x,y)  for  fixed  x,  refer  to  the 
state  x  in  the  transition  diagram.  If  y  4  x  and  x  has  an  arrow  pointing  away 
from  it  to  y,  then  Aj(x,y)  is  equal  to  the  transition  rate  associated  with  this 
arrow.  If  there  is  no  arrow,  Aj(x,y)  =  0.  The  value  of  Aj(x,x)  is  then 

A.(x,x)  =  -^A.{x,y). 

J  y  #  x  J 


4.2.2  Examples 


This  first  example  has  (using  the  notation  of  Section  4.2.1)  M  =  9,  K’h  =  10, 
0=  0.001,  fis  =  1,  Xjj  =  0.002,  Mpj  =  2.  The  parameters  for  the  software  process 
X(t)  are  c  (=Xp)  =  0.95  and  N  =  10  SW  bugs  present  initially. 


The  transition  diagram  (when  X(t)  =  j)  is: 


The  infinitesimal  matrix  corresponding  to  this  diagram  is 

0.01]  0.02  0  0 

-1  0  0  0 

0  -(2.018+0.009)')  0.009)  0.018 

0  1  -10 
0  2  0  -2 


A.  = 
] 


(0.01j+0.02) 

1 

2 

0 

0 
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The  computer  program  was  used  to  compute  the  probabilities  (using 
Equation  4.1  1) 

P { Y ( t )  =  ilY(O)  =  0},  i  =  0,  1 ,  2,  3,  4 

for  t  =  0,  1,  15  hours  and  the  results  are  reproduced  in  Table  4.2-1. 

The  steady-state  probabilities  are 

7r(  0)  =  0.99001079 

7T(1)  =  0 

ir(  2)  =  0.00990011 

n(  3)  =  0 

tt(4)  =  0.00008910 

which  are  derived  by  solving  the  equations  (c.f.  Equations  D.2-15) 

'  4 

^  rr(x)  Aq(x  ,y)  =  0,  y  =  0,  1,  ....  4 

x=0 

■ 

4 

^  7r(  X )  =  1. 

Lx=0 

The  "successful"  operational  states  are  {0,2)  so  that  the  availability  A(t)  is 
computed  by  adding  the  probabilities  for  states  0  and  2  for  each  t.  The  avail¬ 
ability  for  t  =  0,  1,  2,  ....  8  hours  are  shown  in  Table  4.2-2. 

The  next  example  has  (using  the  notation  of  4.2.1)  M  =  8,  Npj  =  10,  P=  0.005, 
y-s  ~  1*  ''■H  =  0-002,  nil  =  1.  The  parameters  for  the  SW  process  are  c  (=AP)  -  0.95, 
and  N  =  10  SW  bugs  initially.  The  transition  diagram  (when  X ( t )  =  j)  is: 


TABLE  4.2-1.  STATE  OCCUPANCY  PROBABILITIES  FOR  THE  FIVE  STATE  EXAMPLE 
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TABLE  4.2-2.  A(t)  FOR  THE  FIVE-STATE  EXAMPLE 


t 

A(t) 

0 

1.0000 

1 

0.9927 

2 

0.9948 

3 

0.9969 

4 

0.9984 

5 

0.9992 

6 

0.9996 

7 

0.9998 

8 

0.9999 

Steady  State  availability  is  7r( 0)  +  7r(2)  =  0.9999119 


The  difference  between  this  example  and  the  previous  example  besides  differ¬ 
ent  parameter  values  is  that  here  there  is  one  more  redundant  HW  unit  (i.e. 
only  8  of  10  are  required  whereas  before,  9  of  10  were  required). 

The  infinitesimal  matrix  corresponding  to  this  diagram  is 


- ( 0. 05j+0. 02) 

0. 05j 

0.02 

0 

0 

0 

0 

1 

-1 

0 

0 

0 

0 

0 

1 

0 

-(1. 018+0. 045j) 

0. 045j 

0.018 

0 

0 

A.  = 

1 

0 

0 

1 

-1 

0 

0 

0 

0 

0 

1 

0  -(1. 

.016+0.04]) 

0.04j 

0.016 

0 

0 

0 

0 

1 

-1 

0 

0 

0 

0 

0 

1 

0 

-1 

The  successful  operating  states  for  this  configuration  are  |o,  2,  4}. 

The  state  occupancy  probabilities  are  shown  in  Table  4.2-3,  and  the 
availability,  A(t),  is  given  in  Table  4.2-4  for  t  =  1,  2 . 9  hours. 

The  parameter  values  for  these  examples  have  been  selected  more  or  less 
arbitrarily  to  illustrate  the  combined  HW/SW  reliability  methodology.  Due  to 
the  large  number  of  entry  variables  for  this  type  of  model,  any  type  of  general¬ 
ized  table  of  availabilities  is  clearly  impractical  to  generate. 

4.2.3  Modifications 

The  redundancy  model  of  Figure  4.2-1  can  be  modified  in  many  ways  to 
incorporate  different  maintenance  and/or  operation  scenarios.  One  such  modifi¬ 
cation,  which  will  be  discussed  in  more  detail  in  Sections  5.3  and  5.4,  is 
providing  for  simultaneous  repair  of  more  than  one  failed  HW  unit  at  a  time. 

As  it  stands,  Figure  4.2-1  assumes  that  only  one  HW  unit  at  a  time  is 
repaired.  To  accommodate  "unlimited"  repair,  it  is  necessary  to  multiply  at 

each  transition  accordingly.  For  example,  the  rate  going  from  state  2  to 
state  0  remains  at  yi  h  >  while  the  rate  going  from  4  to  2  becomes  2  ^  since 
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TABLE  4.2-3.  STATE  OCCUPANCY  PROBABILITIES  FOR  THE  SEVEN  STATE  EXAMPLE 
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state  4  entails  two  failed  HW  units  with  repair  occurring  simultaneously.  The 
other  yn's  are  revised  similarly.  In  addition,  there  is  the  possibility  of 
making  the  SW  repair  states  ys  dependent  on  other  factors.  For  example, 
devoting  maintenance  personnel  to  HW  repairs  may  take  resources  and  manpower 
away  from  SW  repair  efforts  so  that  it  may  be  desired  to  reduce  the  SW  repair 
rate  depending  on  how  many  HW  units  are  undergoing  repair. 

Another  desirable  modification  is  to  allow  for  cold  standby  HW  units.  In  this 
case  the  transition  rates  to  go  from  0  to  2,  2  to  4,  etc  will  be  changed  to  MX  ^ 
since  at  any  point  in  time  prior  to  the  HW  down  state,  there  are  M  units  operat¬ 
ing.  Also,  since  cold  units  cannot  access  the  SW,  the  transition  rates  leading  to 
SW  failed  states  will  be  changed  to0Mj. 


TABLE  4.2-4.  A(t)  FOR  THE  SEVEN  STATE  EXAMPLE* 


t 

A(t) 

0 

1.0000 

1 

0.9647 

2 

0.9749 

3 

0.9850 

4 

0.9924 

5 

0.9965 

6 

0.9985 

7 

0.9994 

8 

0.9998 

9 

0.9999 

Finally,  it  is  not  necessary  nor  desirable  always  to  include  all  HW  failed 
states  into  one  down  state.  Figure  4.2-1  can  be  modified  so  that  a  new  state  for 
each  number  of  failed  HW  units  is  added.  If,  for  example,  HW  continues  to  oper¬ 
ate  while  the  system  is  down,  additional  down  states  H^S  and  H^S ,  k  =  Nfj-M+1, 
N^j-M+2,  . . . ,  Nfl  can  be  added  so  that  the  probabilities  of  being  in  these  states 
can  be  computed. 

4.3  SW  REDUNDANCY 

The  concepts  involved  in  SW  redundancy  are  not  entirely  analogous  to 
those  in  HW  redundancy.  For  example,  there  would  be  no  use  in  employing 
identical  SW  modules  for  back-up  purposes  since  identical  SW  will  produce 
identical  SW  failures.  Also,  there  is  no  such  thing  as  SW  "hot"  standby  since 
routines  cannot  generally  execute  simultaneously  on  the  same  computer.  If 
there  were  redundant  computers  and  memories  in  the  HW  portion  of  the  sys¬ 
tem,  then  the  concept  of  "Hot"  SW  standby  would  be  meaningful  but  hardly 
useful  because  of  what  was  stated  before:  identical  SW  modules  exhibit  identical 
bugs. 


*The  steady  state  availability  is  rr(  0)  +  tt(2)  +  7r(4)  =  0.999994.  Equations 
D  .  2- 15  are  used  to  compute  tt(  0) ,  ir(  2)  ,  and  n(  4)  . 
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One  useful  type  of  SW  redundancy  is  the  use  of  back-up  SW  routines  which 
are  entered  when  error  conditions  are  encountered  in  the  executing  routine. 

Such  back-up  routines  would  perform  alternate  operations  which  do  not  raise 
error  conditions  and  which  allow  completion  of  the  task  in  an  adequate,  although 
possibly  degraded,  fashion.  This  type  of  SW  redundancy  belongs  in  the  realm 
of  fault -tolerant  computing  and  to  model  it  would  require  application  of  the 
methodology  in  this  study  at  the  module  or  routine  level  in  the  SW.  A  separate 
process  X(t)  would  be  needed  to  keep  track  of  the  number  of  faults  in  each  SW 
routine  at  time  t.  Since  it  is  not  likely  that  there  are  independent  maintenance 
teams  devoted  to  each  SW  module  or  routine,  the  various  processes  X(t)  would  be 
correlated  and  the  problem  would  be  intractable.  This  type  of  internal  SW 
redundancy  is  best  handled  by  reducing  the  constant  of  proportionality  used  in 
computing  the  SW  failure  rates  (e.g.  reduce  0) . 

4.4  THE  USE  OF  OTHER  SW  MODELS 

The  selection  of  a  Markov  model  for  X(t)  was  made  for  reasons  of 
tractability  and  because  it  had  desirable  properties  (e.g.,  Markov  structure, 
X(t)  0,  t-*°°,  allowance  for  imperfect  debugging,  etc).  There  is  no  reason 
why  other  models  could  not  be  adopted  provided  of  course  that  X(t)  takes 
non-negative  integer  values  and  provided  X(t)  is  a  step  function.  In  remov¬ 
ing  the  Markov  structure  associated  with  X(t)  the  probabilistic  analysis  of 
the  sample  paths  becomes  exceedingly  difficult  so  it  would  be  advisable  (and 
reasonable)  to  adopt  such  structure.  Outside  of  this  necessity,  most  other 
criteria  may  be  dropped  if  so  desired.  In  addition,  there  is  no  reason  why 
the  SW  failure  rate  need  be  directly  proportional  to  X(t)  (c.f.  Section  4.1.1, 
for  example) .  Other  dependencies  including  polynomial  and  exponential  could 
be  used. 

If  a  different  process  for  X(t)  is  adopted,  simulation  techniques  may  be 
necessary  in  order  to  compute  state  occupancy  probabilities.  This  procedure 
is  straight-forward  if  Markov  structure  is  maintained  for  X(t).  Such  a  simula¬ 
tion  would  entail  generation  of  sample  paths  of  Y(t)  and  computing  statistics 
relating  to  desired  measures.  The  procedure  for  generating  a  sample  path  for 
Y(t)  is  described  by  first  generating  the  jump-times  associated  with  X(t).  The 
value  X(t)  takes  in  each  interval  between  jumps  is  determined  from  the  associated 
parameters  for  X(t)  (described  in  Appendix  B).  The  value  of  X(t)  =  j  in  turn 
determines  the  matrix  Aj  which  describes  the  evolution  of  Y(t)  during  the  interval 
when  X(t)  =  j.  It  should  oe  pointed  out  that  in  many  cases,  the  probabilities  being 
estimated  in  such  a  simulation  are  likely  to  be  very  close  to  1  or  0  so  that  the 
number  of  replications  required  will  be  large,  resulting  in  large  computer  time 
expenditures . 
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4.5  APPROXIMATIONS  TO  THE  HW/SW  MODEL 


Even  for  a  simple  series  system,  equation  (4.1-1)  is  difficult  to  evaluate 
without  the  use  of  a  computer.  For  complex  reliability  constructs  involving  high 
SW  error  content ,  computer  costs  may  become  excessive  because  of  the  precision 
required  in  some  of  the  calculations.  Using  the  baseline  model  computer  program 
(see  Appendix  F)  as  a  standard,  two  methods  of  approximating  availability  were 
examined:  1)  HW  and  SW  availabilities  computed  separately  and  then  combined 
(i.e.,  A(t)  =  Ahw(*)  *  Ag^(t))and  2)  HW  and  SW  failure  and  repair  rates  lumped 
(i.e. ,  \  a  Ahw  +  x  and  u  =  *HW  ^SW^  ) 

~rUHW  X  ySW 

4.5.1  Series  HW  Constructs. 

For  the  first  method,  error  is  maximum  at  the  minimum  availability  point 
(i.e.,  where  A(t)  =  A*).  The  error  then  diminishes  slowly  and  finally  approaches 
zero  as  t  -•  <=° .  The  lumped  parameter  approximation  is  significantly  simpler  than 
the  method  of  approximation  only  when  the  SW  failure  rate  is  considered  constant 
(i.e.,  the  number  of  SW  errors  present  remain  fixed).  In  this  case,  the  error 
is  maximum  at  steady-state. 

Figure  4.5-1  compares  these  two  methods  with  the  "exact"  calculation  pro¬ 
vided  by  equation  (4.1-1).  A  maximum  error  of  1.5%  for  the  first  method  occurs 
at  eight  hours.  For  the  lumped  parameter  method  however,  the  maximum  error 
approaches  14.7%  for  large  t.  If  SW  is  a  significant  factor  in  the  system,  there¬ 
fore,  and  it  is  necessary  to  make  an  approximation  to  the  availability,  then 

A(t)=Asw(t)-AHW(t) 

is  clearly  a  better  choice.  On  the  other  hand,  if  SW  is  not  significant.  lumping 
the  parameters  for  HW  and  SW  may  be  an  adequate  approximation  and  would  be 
easier  to  compute. 

4.5.2  Redundant  HW  Constructs 

As  in  the  series  HW  case,  the  approximation  represented  by  (4.5-1)  has  a 
maximum  error  at  the  minimum  availability  point  and  decreases  for  large  t.  This 
error  increases  almost  linearly  as  the  number  of  HW  units  increase.  Figure  4.5-2 
illustrates  a  redundant  system  with  two  standby  HW  units.  As  the  total  number 
of  units  increases  to  20  (i.e.,  18  of  20  units  required  for  successful  operation), 
the  error  in  the  availability  computation  for  the  example  shown  increases  to  40% 

This  error  is  also  a  function  of  the  HW  and  SW  failure  rates  as  Figure  4.5-3 
indicates.  In  this  comparison,  a  seven-state  contruct  is  used  (i.e.,  5  of  7  units 
required)  and  the  unit  failure  rates  for  HW  and  SW  have  both  been  reduced. 

The  maximum  error  in  this  example  is  only  about  3%  which  compares  to  8%  error 
for  the  corresponding  construct  in  the  Figure  4.5-2  example. 

Lumping  the  SW  rates  with  HW  rates  for  redundant  HW  would  have  essentially 
the  same  effect  as  in  the  series  system  case.  This  method  of  approximation  only 
makes  sense  in  situations  where  the  SW  can  be  considered  redundant  in  the  same 
way  that  the  corresponding  HW  units  arc  redundant.  However,  it  is  not  clear  in 
what  sense  the  SW  could  be  considered  redundant. 
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Figure  4.5-1.  Availability  Approximations  for  Series  Reliability  Construct 


RELATIVE  ERROR  %  (E  =  Aft)  -  AHW(t)  •  Asw 


40 


Figure  4.5-2.  Error  Estimate  Using  A(t)  =  Apj^(t)  '  Agyy(t)  Versus  Number  of  HW  Units 
for  a  Two-Standby  System 
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Figure  4.5-3.  Availability  Approximation  for  a  Two-Standby  Redundant  Construct 


Section  5.0 


EXTENSION  TO  COMPLEX  RELIABILITY  CONSTRUCTS:  C3  SYSTEMS 


5.1  INTRODUCTION 

The  basic  HVV/SW  reliability  methodology  developed  in  Section  3  and  applied 
to  simple  reliability  constructs  in  Section  4  will  now  be  extended  to  more  com¬ 
plex  reliability  constructs.  Figure  5.1-la  illustrates  a  general  reliability  con¬ 
struct  consisting  of  a  series-parallel  configuration  of  HW  units.  Each  series 
"component"  in  the  configuration  shown  in  the  figure  is  a  construct  of  the 
type  already  discussed  in  Section  4.  An  equivalent  "series"  version  of  the 
general  construct  is  given  in  (b).  As  will  be  seen  later,  complex  system 
models  will  be  developed  around  this  equivalent  "series"  configuration. 

Because  of  the  added  complexity  of  including  SW  in  the  reliability  measures, 
more  attention  must  be  given  to  mission  tasks  and  their  HW/SW  manifestations 
than  in  previous  reliability  methodologies  where  only  the  HW  portion  of  the 
system  was  modeled.  Moreover,  for  maintained  systems  consideration  must 
also  be  given  to  how  SW  is  maintained  during  system  operation  (i.  e. ,  the 
immediate  impact  of  SW  failures  on  system  restoration  which  is  analogous  to 
the  HW  MTTR)  as  well  as  the  fault  correction  process  which  impacts  the  SW 
failure  rate.  The  term  "mission"  is  used  in  a  general  sense  to  include  both 
extended  or  ongoing  operations  as  well  as  short  term  or  one-shot  operations. 
Depending  on  the  purpose  of  the  mission,  therefore,  the  duration  may  be 
measured  in  hours,  days,  months  or  on-going  (continuous  operation)  for  the 
total  life  of  the  system.  Accordingly,  the  selection  of  figures  of  merit  (FOM's) 
used  to  measure  system  reliability  depend  on  the  purpose  and  duration  of  the 
mission.  In  the  case  of  an  airborne  system,  for  example,  the  mission  duration 
is  typically  measured  in  hours  and,  therefore,  the  probability  of  failure- free 
operation  during  the  conduct  of  a  mission  is  an  appropriate  FOM.  On  the 
other  hand,  the  FOM  for  a  ground-based  system  is  better  represented  by  the 
mean-time-to-failure  (MTTF)  or  the  availability. 

The  development  of  a  system  model,  therefore,  starts  with  1)  the  state¬ 
ment  of  the  system  reliability  requirements  defined  by  the  appropriate 
FOM's  and  2)  definition  of  mission  tasks  and  the  associated  SW  and  HW 
required  for  implementation.  Detailed  reliability  constructs  of  the  system  are 
then  developed  based  on  the  definition  of  system /subsystem,  alternate  states  of 
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successful  operation,  degraded  operation,  types  of  redundancy,  and  the  repair 
policy.  The  development  of  a  combined  HW/SW  reliability  model  for  a  typical 
command,  control  and  communications  (C^)  system  is  illustrated  in  Section  5.2. 

The  combined  HW/SW  reliability  theory  developed  in  Section  3  allows 
treatment  of  the  failure /repair  criteria  in  a  "normal"  manner.  This  treatment 
is  normal  in  the  sense  that  unit  HW  failure  and  repair  rates  are  independent 
of  the  SVV  rates  and  conversely.  Thus,  a  reliability  engineer  would  not  be 
required  to  collect  data  and  estimate  the  value  of  some  "combined"  HW/SW 
metric  in  order  to  compute  a  combined  HW/SW  system  reliability.  However, 
the  reliability  engineer  will  be  required  to  select  suitable  SW  models  and 
estimate  certain  SW  parameters  whereas  he  was  previously  concerned  only 
with  the  HW  parameters.  The  procedures  for  determining  unit  failure  rates 
and  repair  rates  are  given  in  Sections  5.3  and  5.4,  respectively. 

5.2  DEVELOPMENT  OF  THE  SYSTEM  RELIABILITY  MODEL 

5.2.1  Mission  Tasks  and  Operating  Scenarios 

Mission  tasks  are  defined  based  on  a  specified  operational  requirement  de¬ 
fined,  for  example,  to  meet  a  potential  threat.  For  a  system,  the  operational 
requirement  might  be  "air  surveillance  coverage  over  a  specified  region"  and 
typical  mission  tasks  would  be  detection,  identification,  track,  interceptor  con¬ 
trol,  etc.  Figure  5.2-1  gives  a  simplified  overview  of  the  major  HW  and  SW  ele¬ 
ments  contained  in  this  type  of  system.  The  numbers  in  the  boxes  represent  the 
number  of  HW  units. 

Detailed  HW  and  SW  requirements  are  derived  from  the  parameter  require¬ 
ments  and  constraints  of  each  mission  task.  In  the  case  of  HW ,  these  require¬ 
ments  result  in  defining  the  type  of  computer  (i.e.,  processing  capability, 
memory  size,  etc),  the  number  and  kind  of  peripheral  devices,  and  communi¬ 
cations  equipment  to  external  sensors.  These  HW  units,  however,  do  not  by 
themselves  accomplish  mission  tasks.  They  are  "driven"  by  human  operators 
or  other  stimuli  through  imbedded  SW. 

For  SW,  there  is  a  more  direct  relationship  between  missioq  tasks  and 
computer  program  components  (or  functions).  The  SW  tor  C*  systems  is 
normally  partitioned  into  functionally-oriented  sets,  computer  program  config¬ 
uration  items  (CPCI's).  For  example,  an  air  surveillance  system  currently  in 
production  at  Hughes  includes  six  sets  as  follows:  operating  system  set  (OSS), 
applications  set  (APS),  support  set  (SUS),  system  exercise  set  (SES),  data  reduc¬ 
tion  set  (DRS),  and  diagnostic  set  (DIS),  The  OSS  includes  such  functions  as  con¬ 
fidence  checking  and  startover,  and  the  APS  contains  the  functions  for  conducting 
the  primary  mission  tasks,  such  as  active  correlation  and  flight  plans.  For  an 
air  surveillance  mission,  the  primary  function  of  the  APS  is,  therefore,  to 
establish  the  position  of  radar-reported  targets  and  provide  system  operators 
with  the  capability  to  classify  them  and  direct  their  disposition.  Figure  5.2-2 
provides  a  typical  breakdown  of  the  air  surveillance  mission  into  mission  sub- 
tasks  which  relate  directly  to  computer  program  functions.  The  processing 
is  distributed  into  a  central  computer  and  three  mini-computers  (programmable 
peripheral  controllers) .  Target  detection  data  enters  the  system  (via  a  com¬ 
munications  interface)  from  various  sensors  (e.g.,  remote  radar  sites).  The 


Figure  5.2-1.  Simplified  Overview  of  Air  Surveillance  System.  SW  is  executed  via  external 
data,  direct  computer  input,  console  operators  and  remote  access  terminals. 

target  height  is  determined  by  the  height  operator  using  the  HEIGHT  function 
and  identified  using  the  HEIGHT  and  FLIGHT  PLAN  functions.  If  the  target 
is  determined  to  be  hostile,  the  weapons  director  will  dispatch  the  appropriate 
interceptor  for  closer  investigation  using  the  INTERCEPT  CONTROL  function. 


Some  of  these  mission  tasks  are  performed  on  a  near  continuous  basis 
(e.g.,  tracking  air  traffic)  while  other  tasks  (e.g.,  interceptor  control) 
are  to  be  performed  infrequently.  The  fact  that  some  mission  tasks  are  exer¬ 
cised  more  frequently  than  others  generally  has  very  little  effect  on  HW  units 
since  they  are  not  usually  dedicated  to  single  tasks  and  many  tasks  are  being 
conducted  simultaneously.  However,  the  effect  on  SW  is  direct  and  can  be 
substantial:  if  a  mission  task  is  not  conducted  the  corresponding  SW  is  not 
executed  and,  therefore,  cannot  "fail."  Therefore,  for  a  given  operating 
scenario  (i.e.,  mission  task  loading  during  peak  operating  hours,  peacetime 
conditions,  battle  conditions,  etc),  the  SW  functions  that  support  the  various 
mission  tasks  are  "duty  cycled."  As  in  the  case  of  HW,  this  duty  cycle  should 
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Figure  5.2-2.  Air  Surveillance  Mission  Sub-task  Breakdown  Into  Computer  Program  Functions 


be  taken  into  account  in  allocating  SW  failure  rates.  This  is  discussed  in  more 
detail  in  Section  5.3.2. 

5.2.2  System  Reliability  Model 

With  reference  to  Figure  5.2-1,  a  reliability  block  diagram  can  be  con¬ 
structed  as  shown  in  Figure  5.2-3.  HW  redundancy  has  been  added  for  pur¬ 
pose  of  illustration.  The  block  diagram  shows  five  "system  components”  which 
relate  to  the  reliability  construct  described  in  Figure  5. 1-1:  Component  1  is  a 
series  construct  with  SW  shown  as  a  separate  series  block;  Component  2  is  all  HW 
and  contains  some  unit  redundancy;  finally.  Components  3  through  5  contain  HW 
redundancy  which  interacts  with  SW  in  some  known  way.  The  system  SW  shown  in 
the  figure  has  been  partitioned  by  the  "accessing"  HW  units  based  on  relative  utili¬ 
zation  during  a  typical  mission.  Note  that  the  programmable  controllers  do  not 
access  SW  even  though  they  may  contain  memory  modules,  mag  tape  units, 
etc.  The  controller  SW  is  accessed  only  by  the  display  consoles  (e.g.,  a 
human  operator  calls  up  a  SW  routine  via  the  display  switch  action  which 
executes  the  program),  remote  terminals  and,  indirectly,  by  external  stimulus 
through  the  communications  interface  (e.g.,  automatic  tracking  updates  to  the 
console  operators).  A  certain  amount  of  care  must  be  exercised  in  partition¬ 
ing  the  SW  among  the  accessing  units  within  a  component  as  well  as  between 
components.  Table  5.2-1  illustrates  an  example  partitioning  of  the  air  sur¬ 
veillance  system  represented  in  Figure  5.2-1.  The  numbers  in  the  table 
represent  the  percentage  of  SW  utility  by  each  HW  unit  based  on,  for  example, 
a  system  sizing  and  timing  analysis.  Thus,  of  the  total  time  that  the  APS  is 
executed  during  the  performance  of  a  typical  or  average  mission,  communica¬ 
tions  is  responsible  for  20%,  the  system  control  console  for  5%,  the  central 
computer  for  10%  and  the  display  consoles  for  a  total  of  65%  (note  that  the 
table  rows  sum  to  100%  utility) . 
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Figure  5.2-3.  Reliability  Block  Diagram  for  a  System 


1 


SW  partitioning  is  only  necessary  when  there  is  interaction  between  HW 
and  SW  such  that  a  HW  failure  changes  the  SW  failure  rate  (this  is  discussed 
in  more  detail  in  Section  5.3).  Otherwise,  the  total  SW  subsystem  can  be 
considered  as  a  separate  series  component  in  the  reliability  block  diagram. 

In  Section  5.3,  this  partitioning  is  used  to  compute  the  necessary  SW  failure 
rates  for  each  component  as  a  function  of  the  model  parameters  for  each  CPCI. 

The  equations  for  computing  various  system  level  FOM's  for  series -parallel 
configurations  are  given  in  Table  5.2-1.  Most  of  these  FOM's  can  be  recog¬ 
nized  as  standard  HW  reliability  measures.  Accordingly,  they  reduce  to  the 
standard  HW  case  when  the  SW  is  removed.  Implicit  in  the  formulation  of 
these  FOM's  is  the  assumption  of  operational  independence  between  the 
components  (i . e .  ,  when  the  system  is  in  a  failed  state  undergoing  repair 
the  remaining  units  of  the  system  remain  in  an  operating  state) .  Even  when 
this  assumption  is  incorrect,  it  provides  a  reasonable  approximation  for 
modern  repairable  electronic  systems  since  the  time  to  repair  is  much  shorter  ? 

than  the  time  between  system  failures.  Moreover,  for  large  systems,  a 
complete  shut-down  of  all  equipment  does  not  generally  happen  whenever  a 
system  failure  occurs.  In  any  case,  the  equations  given  in  the  table  provide 
a  close  (conservative)  approximation  to  the  extent  that  system  components 
do  not  operate  independently.  For  non  repairable  systems,  equations  5. 2. 2-1 
and  5. 2. 2-3  provide  exact  solutions.  The  interpretations  of  equation  5. 2. 2-1 
through  5. 2. 2-5  are  the  same  as  those  found  in  HW  reliability  theory  (e.g.. 

Barlow  and  Proschan,  1965,  Kozlov  and  Ushakov,  1970).  Equations  5. 2. 2-6 
and  5. 2. 2-7  are  new  and  are  useful  in  measuring  the  unique  affects  of  SW 
on  system  availability. 

The  individual  terms  in  each  equation  represent  the  various  components  of 
the  system  as  defined  previously.  For  components  which  do  not  contain  any 
SW,  the  computations  are  carried  out  using  standard  HW  reliability  formulae. 

For  components  which  contain  SW ,  equations  have  been  derived  using  a  Markov 
SW  model  which  are  provided  in  Section  4.  The  approximations  given  for  equa¬ 
tions  5. 2. 2-1,  5. 2. 2-2  and  5. 2. 2-3  are  good  when  the  component  repair  times 
are  small  relative  to  the  component  times  to  failure  (i . e .  ,  ~x<<  x) .  Two  compli¬ 
cations  make  the  computation  of  the  mean -time -between -failures  (MTBF) 
intractable:  1)  the  probability  distribution  laws  of  the  system  components 
can  be  arbitrary  (because  of  redundancy  possibilities)  and  2)  the  SW  failure 
rate  changes  in  time.  For  SW -dominated  systems,  however,  the  mean-time- 
to-failure  (0)  is  a  conservative  approximation  to  the  MTBF  because  of  the 
improving  SW. 


■ 


f 
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TABLE  5.2-2.  SYSTEM  EQUATIONS  FOR  COMPUTING  COMBINED 

HW/SW  RELIABILITY 


System  Figures  of  Merit  (FOM's) 


(5. 2. 2-1) 


I 


where: 


K  =  Number  of  system  components 


9 


X 


Px(t)  dt 


Definition 


Mean  time  to  failure 
(MTTF) 


P^ft)  =  Probability  of  failure-free  operation 
of  Component  X  for  t  hours. 


(5. 2. 2-2) 


.1 


X=1 


<TX/0X) 


where : 

=  Mean-time-to-repair  Component  X. 
K 

(5. 2. 2-3)  P(t)  =  Y\  px(t)  =  e  1/6 
X=  1 

K 

(5. 2. 2-4)  A(t)  =  ]~[  AX(t) 

X=1 

(5. 2. 2-5)  A  =  Lim  A(t) 
ss  t  -  ® 

(5. 2. 2-6)  A*  =  minA(t); 

t 

K 

]~j  min  A^(t)  <  A*^  min  min  A^ft) 

X=1  X  1 


Mean  time  to  system 
restoration 


Probability  of  failure- 
free  operation  for  t 
hours . 


Availability  of  the 
system  at  time  t. 


Steady  state  availa¬ 
bility  of  the  system 

Minimum  availability 
(i.e. ,  the  minimum 
undershoot) 

See  figure  4. 1-3. 
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TABLE  5.2-2.  SYSTEM  EQUATIONS  FOR  COMPUTING  COMBINED 
HW/SW  RELIABILITY  (Continued) 


5.3  FAILURE  DEFINITION 

5.3.1  Derivation  of  Transition  Rates  for  HW  Failure 

The  failure  rates  of  the  units  in  Figure  5.2-3  (e.g.  ,  the  computer,  a  dis¬ 
play  console,  controller  A,  etc)  are  assumed  to  be  constant  (i.e. ,  an  exponential 
time  to  failure  model  is  assumed).  If  they  were  not  (i.e.,  contained  redundancy), 
then  another  level  of  detail  in  the  system  model  would  be  required  in  order  to 
reach  units  of  constant  failure  rate.  With  this  assumption,  therefore,  these  HW  fail 
ure  rates  are  derived  in  the  normal  manner  based  on,  for  example,  applying  MIL- 
HDBK-217,  to  part  counts  under  specified  electrical  and  environmental 
stresses.  When  redundant  units  are  utilized  in  the  system  (e.g.,  the  operator 
display  consoles  of  component  5),  the  transition  rates  to  the  various  states 
(e.g.,  operating,  degraded  and  failed)  are  functions  of  the  hardware  failure 
rate  and  the  current  operating  state  of  the  system.  Thus,  if  the  failure  rate 
of  a  single  operator  display  console  (ODC)  isA.0DC  then  the  rate  at  which 
component  5  transitions  from  18  consoles  to  17  consoles  (one  console  down) 
is  18Aqj)c,  and  from  17  to  16  consoles  (the  failed  state  of  component  5)  is 
17AQDC-  1°  the  general  rate  of  transition  from  state  to  state  is  dependent  on 
the  number  of  units  and  the  type  of  redundancy  (i.e.,  whether  the  standby 
units  are  fully  operational,  partially  operational  or  non-operational.  The  HW  transi¬ 
tion  ratexjiwW  from  the  ith  state  to  the  i+1  state  is  given  below  for  these  three  cases: 

'  (N^-i)  x  ,  0  <  i  <N^-M,  fully  operational 
(5. 3. 1- 1)  A^(i)  =  <  [M  +  (Njj-M-i)v]  A  ,  0<i<Njj-M,  partially  operational. 

k  0  <i  <Npj-M,  non-operational 

where  Njj  is  the  total  number  of  units  available;  M  is  the  number  of  units  required 
for  successful  operation;  \u  is  the  failure  rate  of  a  single  unit;  v  represents  the 
fraction  of  Au  at  which  the  standby  units  are  operating;  and  i  is  the  number  of  HW 
units  that  are  down. 


In  equation  (5. 3. 1-1),  i  =  0  represents  the  full-up  state  for  the  compo¬ 
nent  and  i  =  Njj-M+1  represents  the  failed  state.  The  values  for  Xhw^)  Prc>‘ 
vide  the  state-by-state  failure  transitions  from  full-up  to  failure  used  in  the 
reliability  constructs  of  system  components  (see  Section  4.0). 

5.3.2  Derivation  of  SW  Transition  Rates  for  Failure 

The  following  position  is  taken  on  the  meaning  of  a  SW  "failure:" 

1)  Failure  occurs  as  a  result  of  a  latent  SW  fault  which  manifests 
itself  by  causing  one  or  more  units  in  the  system  to  malfunction, 
or  by  transmitting  erroneous  information  to  a  user. 

2)  Failure  occurrences  are  random  in  the  sense  that  SW  is  executed 
during  the  conduct  of  a  mission  in  a  random  manner.  Thus,  the 
external  stimuli  of  an  operational  environment  (e.g.,  target  iden¬ 
tification,  tracking  information)  is  considered  random  phenomena, 
and  SW  is  exercised  by  this  stimuli  via  the  peripheral  HW  units. 

SW  fault  causes  stem  from  various  error  sources  (i.e.,  logic,  data  defini¬ 
tion,  data  handling,  interface,  computational,  etc).  Fault  manifestations  also 
have  varying  effects  on  mission  performance,  such  as  critical,  major,  or  minor 
(nuisance).  Therefore,  not  all  fault  manifestation  result  in  a  system  failure. 

If  information  is  collected  on  SW  error  data  for  purpose  of  estimating  the  SW 
failure  rate  parameters  (discussed  below),  only  error  data  classified  as 
failure -causing  (e.g.,  critical  or  major)  should  be  used. 

The  SW  failure  rate  model  assumed  for  application  to  specific  reliability 
constructs  in  Section  4.0  is  a  Poisson  type.  Other  SW  models  are,  of  course, 
possible  and  the  methodology  is  not  affected  by  another  selection  other  than 
for  parameter  changes.  A  basic  parameter  of  most  of  the  current  SW  models 
is  the  initial  number  of  faults  (or  bugs),  N0.  Using  various  estimating  tech¬ 
niques,  the  value  of  N0  can  be  determined  from  error  historical  data.  Attempts 
have  also  been  made  to  estimate  N0  based  on  a  number  of  SW  complexity  metrics 
(e.g.,  cyclomatic  number,  module  fan-in  and  fan-out,  syntactical  constructs 
and  manpower)  using  descriptive  types  of  information  (Fitzsimmons,  1978; 
Schneider,  1980;  Winchester,  1978;  et  al).  The  baseline  SW  failure  rate  (Xg^) 

has  the  following  form: 

■V 

A(j)  =  4>(N  -  j)  (5. 3. 2-1) 

SW 

where  N0  is  defined  above  and  4>  is  a  constant  of  proportionality  (representing  the 
failure  rate  of  a  single  fault)  also  determined  from  error  data.  Thus,  Asw(j)  is 
proportional  to  the  number  of  bugs  remaining  in  the  SW  after  j  bugs  have  been 
corrected.  If  we  only  consider  the  number  of  bugs  remaining  in  the  SW,  say  N, 
then  5.  3.  2-1  can  be  written: 

X(N)  =  <J>N. 


(5. 3. 2-2) 


The  SW  failure  rate  is  partitioned  according  to  how  the  various  program 
sets  (e.g. ,  APS,  OSS  and  SUS)  are  exercised  by  the  system  HW  units. 

For  example,  using  the  SW  partition  illustrated  in  Table  5.2-1  the  failure  rate 

of  the  operator  display  console  component  (Ag^r.)  is: 

o 


AgVV  0.05  Nqss  boss 

0 


+  0.  2  N 


SUS  ^SUS 


+  0.  3  N 


APS  yAPS 


+  0. 15  N 


DIS  ^DIS 


+  0.3  N 


SES  SES 


where  Ny  and  4>x  parameters  are  estimated  from  error  data  collected  on  the 
individual  CPCI's  and  the  table  values  represent  the  fraction  of  the  total 
time  each  CPCI  is  exercised  by  the  operator  display  console  (Component  5). 

In  addition  to  the  partitioning  of  SW  across  system  components,  the  SW 
"duty  cycle"  within  a  system  component  must  be  considered.  For  example, 
if  the  error  data  on  CPCI  X  used  to  estimate  Ny  and  <j;v  is  collected  under 
test  conditions,  then  <|>X  must  be  adjusted  to  represent  field  operating  conditions. 
Similarly,  SW  dependence  on  the  state  of  the  HW  should  also  be  considered. 

This  dependency  is  unique  to  each  application  and  can  be  very  complex.  For 
series  components,  the  SW  can  always  be  modeled  as  an  independent  series 
box  as  in  Section  4.1.  For  system  components  which  contain  redundant  HW 
units,  two  extreme  situations  can  be  modeled  which  will  provide  bounds  for 
most  cases  are:  1)  failure  of  a  HW  unit  within  the  component  does  not  effect  the 
SW  (i.e.  ,  SW  loading  remains  constant)  so  that  the  SW  could  be  treated  as  a 
separate  "unit"  in  series  with  the  HW,  and  2)  failure  of  a  HW  unit  reduces  the 
loading  on  the  SW  resulting  in  a  proportionate  decrease  in  the  SW  failure  rate.* 
Thus ,  if  the  values  of  the  SW  parameters  N  and  4>  are  based  on  the  loading 
intensity  of  a  single  HW  unit  within  a  component  containing  Njj  units  (of  which 
M  are  required  for  successful  operation) ,  then  the  SW  transition  rates  for 
the  i*b  system  state  (i.e.  ,  i  HW  units  are  in  repair)  are  given  by: 


$XNx(NH'i) ,  0  <  i  <  Nh~M,  HW  state  dependent 
(operating  standby) 

dxNX.  HW  state  independent  or  a  series  system. 

where  NH-M  represents  the  total  number  of  redundant  standby  units.  HW 
state  dependent  transition  rates  for  partial  standby  units  and  non-operating 
standby  units  are  similarly  applicable  as  defined  in  equation  (5.3. 1-1). 

5.4  QUANTIFIABLE  MAINTAINABILITY  CONCEPTS 

5.4.1  Derivation  of  Transition  Rates  for  HW  Repair 

The  time  to  repair  distribution  is  assumed  to  be  exponential.  This 
assumption  is  necessary  for  tractability .  However,  in  cases  where  the  time-to- 
failure  is  much  larger  than  the  time  to  repair  (typical  of  modern  repairable 
electronic  systems),  this  assumption  approximates  the  case  where  the  time-to- 
failure  distribution  is  exponential  and  the  time  to  repair  distribution  is 
arbitrary  (Kozlov  and  Ushakov,  1970). 

*  (An  example  of  this  is  a  functionally  redundant  unit) 
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A 


(5. 3. 2-3)  A(N,i)  = 
SW 


The  expected  repair  time  of  individual  HW  units  is  generally  based  on  prediction 
data,  built-in  test  capability  and  maintainability  concepts  unique  to  the  HW  usage. 

The  HW  transition  rate  from  a  state  of  i  HW  units  down  to  the  state  with  i-1  HW 
units  down  is  dependent  on  the  number  of  units  in  repair  and  the  number  of  repair¬ 
men  servicing  these  units.  Bounds  are  obtained  oy  the  two  extremes: 

i  Muh\V ’  1  <Nh-M,  unlimited  repairmen. 

(5.4. 1-1)  p(i)  =< 

HW  ^UHW  ’  1  -  NH  "  M»  single  repairman. 

where  HuHW  *s  t^le  estimated  unit  repair  rate  and  N^-M  represents,  as  before, 
the  number  of  standby  units.  Thus,  in  the  unlimited  repairmen  case  every 
failed  unit  is  assigned  a  repairman  working  at  a  rate  off*uHW>  and,  in  the 
single  repairman  case  the  rate  is  constant  regardless  of  the  number  of  units 
failed.  Most  of  the  actual  repair  situations  for  systems  will  fall  within 
these  two  extremes. 

5.4.2  Derivation  of  Transition  Rates  for  SW  Repair 

5 . 4 . 2 . 1  I  nt  roduction 

Although  there  are  similarities  between  SW  and  hardware  maintainability 
(such  as  fault  isolation),  the  term  maintainability  as  applied  to  SW  must  be 
used  carefully.  A  synonym,  repairability  is  the  ability  to  restore  something 
to  its  state  before  failure.  In  a  strict  sense,  SW  is  not  repaired,  because 
when  SW  fails  it  is  corrected.  SW  maintainability  entails  the  ability  to  change 
a  computer  program  and/or  data  to  a  new  state,  and  does  not  entail  the 
ability  to  restore  to  a  previous  state.  Therefore,  when  the  term  "repair"  is 
used  in  connection  with  SW ,  it  should  always  be  taken  in  this  later  context . 

Of  the  three  classes  of  SW  maintenance:  1)  corrective,  2)  adaptive,  and 
3)  perfective,  only  corrective  is  applicable  to  the  quantitative  assessment  of 
a  software  failure's  contribution  to  system  availability  and  operational  readiness. 
The  other  two  classes  can  be  considered  applicable  only  in  the  sense  that  such 
maintenance  activities  can  introduce  faults  which  produce  failures  that  in 
turn  require  corrective  maintenance. 

5. 4. 2. 2  General  SW  Maintainability  Considerations 

The  scope  of  this  task  was  to  investigate  and  develop  SW  maintainability 
concepts  that  would  in  turn  lead  to  the  development  of  a  combined  HW/SW 
system  availability  measure.  The  investigation  started  with  a  literature  search. 
Several  proposed  hierarchies  of  SW  maintainability  attributes  were  found . 
However,  care  must  be  taken  in  determining  those  attributes  that  directly 
influence  SW  maintainability.  As  Feuer  and  Fowlkes  (1979)  point  out: 
"identifying  program  characteristics  that  may  coincidently  vary  with  a  funda¬ 
mental  property  tells  us  little  about  the  property."  The  approach  to  selecting 
quantifiable  SW  maintainability  attributes  was  to  use  the  McCall  and  Matsu moto 
(1980)  hierarchy  (see  Figure  5. 4. 2-2)  as  a  starting  point.  The  SW  maintain¬ 
ability  criteria  were  scrutinized  for  associated  metrics  that  were  coei’cible. 


Some  attributes,  such  as  Quantity  of  Comments  were  eliminated  to  develop  a  final 
exhaustive  set  of  SW  maintainability  attributes  and  metrics.  Instead  a  set  was 
compiled  that  includes  attributes  directly  representative  of  SW  maintainability 
characteristics,  and  associated  criteria  that  are  easy  to  measure  and  convert 
into  SW  transition  rates  for  correction. 

Some  (such  as  Yau,  1980)  contend  that  SW  maintainability  cannot  be  pre¬ 
dicted,  but  that  it  can  be  measured  as  the  SW  proceeds  through  the  develop¬ 
ment  phases.  Yau  views  maintainability  as  including  the  resistance  to  both 
logical  and  performance  ripple  effects,  and  has  developed  an  algorithm  for 
measuring  the  resistance  (or  stability)  based  on  the  number  of  logical  and 
performance  attributes  changed.  Curtis,  et  al.  (1979),  report  on  experiments 
where  metrics  applied  during  the  coding  phase  did  predict  the  difficulty  in 
performing  maintenance  activities. 

During  the  investigation  of  SW  maintainability  concepts,  Hughes  has  con¬ 
centrated  on  those  activities  that  have  associated  quantifiable  factors.  The 
factors  of  interest  are  those  that  can  be  included  in  system  metrics  (e.g.  , 
mean-time-to-repair)  or  contribute  to  system-level  figures  of  merit  for  design 
attribute  tradeoffs. 

5. 4.  2. 3  Consideration  of  SW  Maintainability  Techniques  in  Modeling  System 
Restoration 

In  developing  metrics  that  model  restoration  times  during  mission  operation, 
one  must  consider  all  known  SW  maintenance  techniques.  Certain  quick-fix 
techniques  such  as  patching  -  although  not  condoned  as  a  good  configuration 
control  practice  -  are  certainly  effective  in  restoring  a  critical  C3  system 
function  during  operation.  Table  5. 4. 2-1  summarizes  common  software  main¬ 
tenance  techniques,  and  indicates  the  general  effect  on  system  availability  and 
operational  capability. 

Startover  and  switchover  are  common  techniques  for  attempting  to  restore 
a  system,  especially  in  environments  where  maintenance  facilities  are  not 
collocated.  The  risk  with  this  technique  -  although  causing  minimal  downtime 
as  listed  in  Table  5. 4. 2-1  -  is  that  the  software  fault  may  still  remain  in  a 
subtle  form  or  that  the  restart  does  not  completely  resynchronize  the  system 
or  restore  critical  data.  A  more  effective  method  for  startover  is  to  collect 
data  that  will  support  troubleshooting,  prior  to  restarting.  This  method 
increases  the  probability  of  correcting  a  fault  that  otherwise  may  recur  during 
a  critical  mission. 

If  startover  and  switchover  capabilities  are  not  provided  or  unsuccessful 
then  a  reload  and  initialization  of  the  software  subsystem  is  usually  performed. 
It  is  at  this  point,  before  the  reload,  that  extra  time  should  be  taken  to 
execute  selected  on  line  fault  isolation  and  generate  memory  printouts  for 
off  line  troubleshooting.  A  typical  reload  can  be  completed  in  four  minutes 
while  an  initialization  can  take  five  to  ten  minutes  depending  on  the  interface 
configuration . 
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TABLE  5.4. 2-1.  IMPACT  OF  SOFTWARE  MAINTAINABILITY  TECHNIQUES 

ON  SYSTEM  RESTORATION 


Restoration  Technique 

System  Availability 

Operational 

Capability 

•  Automatic  Startover 

Downtime  (<1  min.) 

No 

degradation 

•  Automatic  Switchover 

Downtime  (<2  min.) 

No 

degradation 

•  Semiautomatic  Startover 

Downtime  (<3  min.) 

Suspect  if 
no  safe  data 

•  Manual  Reload 

Reload  downtime  (<5  min.) 

No 

degradation 

•  Fallback  to  previous  version /release 

Reload  downtime  (<5  min.) 

Lacks  latest 
revisions 

•  Patch 

Unavailable  during  problem 
analysis  and  patch  imple¬ 
mentation  and  test 
( <13  hr.) 

No 

degradation 

•  Selective  Recompilation /Reassembly 

Possible  unavailability 
during  problem  analysis, 
source  correction,  recom¬ 
pilation,  test  and  reload 
( <15  hr.) 

No 

degradation 

•  Complete  Recompilation /Reassembly 

Unavailable  during  problem 
analysis,  source  correc¬ 
tion,  recompilation,  system 
generation,  test,  and 
reload  ( <17  hr.) 

No 

degradation 

•  Load  new  application  program 

Reload  downtime  (<5  min.) 

No 

degradation 

Patching  is  the  correction  of  a  program  in  machine  code,  regardless  of  the 
language  in  which  the  program  was  originally  coded.  In  most  systems  patches  may 
be  entered  into  the  system  through  the  computer  console.  After  a  patch  is  imple¬ 
mented  it  may  be  tested  online  by  using  the  patching  technique  to  vary  input  data 
and  to  force  certain  path  executions.  A  successful  simple  patch  can  be  implemented 
in  approximately  one  hour  (see  Table  5.  4,  2-4). 


Recompilation,  reassembly,  and  link-loading  often  occur  on  a  "host"  com¬ 
puter  other  than  the  one  on  which  the  program  is  to  be  run.  This  is  done  to 
take  advantage  of  support  facilities  which  large  computing  systems  possess  and 
small,  typically  real-time,  computers  do  not.  Large  computing  systems  often 
have  prohibitive  turnaround  time,  however  most  systems  provide  for 
dedicated  support  computers  which  are  collocated  with  the  operational  config¬ 
uration,  in  order  to  reduce  the  recompilation  turnaround  time.  A  recompila¬ 
tion  correction  on  an  unsupported  computer  could  cause  downtime  from  two  and  a 
half  to  seventeen  hours  depending  on  the  complexity  of  the  fault . 

For  completeness,  the  related  topics  of  configuration  management  and  inter¬ 
active  versus  batch  processing  should  be  mentioned  since  these  areas  impact 
the  SW  fault  correction  process.  Although  patching  is  the  recommended  tech¬ 
nique  for  a  quick  fix  in  an  operational  environment,  it  can  cause  software 
configuration  control  problems,  and  thus  slow  down  fault  corrections  during 
subsequent  source-level  maintenance  because  the  octal  patch  form  of  the  cor¬ 
rection  usually  does  not  directly  correspond  to  the  source  form.  Octal  is  a 
commonly  used  three-bit  pattern  that  is  convenient  for  machines  with  a  basic 
word  size  that  is  divisible  by  three.  Examples  of  problem  causes  are:  failure 
to  document  the  correction  in  patch  form;  and  the  introduction  of  faults  during 
the  conversion  of  patch  to  source  corrections.  Interactive  maintenance  is 
becoming  increasingly  popular  during  SW  development.  For  example, 
the  Source  Code  Control  System  of  the  Programmers  Workbench  provides 
responsive  change  and  selective  recompile  capabilities  under  a  protective 
mechanism  against  unauthorized  users.  Such  systems  -  although  designed 
for  timesharing  SW  development  -  could  be  adapted  to  C^  operations  by  pro¬ 
viding  an  emergency  dedicated  mode  that  could  be  invoked  from  a  remote  site. 
The  resulting  computer  program  could  be  automatically  transmitted  to  the  site 
after  being  tested  at  the  support  location  via  the  remote  terminal.  Such  a 
technique  with  its  attendant  reliability  and  configuration  management  might  be 
competitive  in  responsiveness  with  the  on-site  patching  technique. 

5. 4.1!.  4  Selection  of  Quantifiable  Attributes 

A  prerequisite  for  measuring  SW  maintainability  is  the  determination  of  those 
attributes  that  characterize  maintainability.  A  search  of  the  literature  revealed 
several  candidate  sets  of  attributes.  Boehm  (1973,  et  al.,  decomposed  main¬ 
tainability  into  three  attributes  and  nine  subattributes  as  shown  in  Figure 
5. 4. 2-1. 

McCall  and  Matsumoto  (  1980)  decompose  maintainability  into  five  criteria 
as  shown  in  Figure  5. 4. 2-2,  and  then  into  eleven  metrics.  Yau  proposes 
stability  as  the  most  critical  SW  maintainability  factor  where  stability  is 
further  decomposed  into  functional  and  performance  subfactors.  Most  of  the 
candidate  SW  maintainability  factors  can  be  further  defined  in  terms  of  quanti¬ 
fiable  subfactors  that  can  be  either  measured  or  estimated  during  early  phases 
of  SW  development. 
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ACCOUNTABILITY 


Figure  5.4. 2-2.  McCall-Matsumoto  SW  Maintainability  Tree 


Hughes  suggests  the  decomposition  of  SW  maintainability  into  three 
attributes,  as  shown  in  Figure  5. 4. 2-3,  that  are  representative  of  the  time- 
consuming  activities  of  problem  analysis  and  resolution.  Self-descriptiveness 
is  an  attribute  of  software  that  characterizes  the  extent  to  which  the  source 
code  (including  comments  and  prologue)  of  a  computer  program  contains 
enough  information  for  a  software  engineer  to  maintain  it.  Complexity  is  an 
attribute  of  software  that  characterizes  the  degree  of  difficulty  a  software 
engineer  may  encounter  --in  terms  of  syntactical  structures,  control  and 
data  flow,  module  interconnections,  and  entropy  --  when  tasked  to  modify  an 
existing  computer  program.  Modularity  is  an  attribute  of  software  that  char¬ 
acterizes  the  extent  to  which  a  portion  (module)  of  a  computer  program  can 
be  modified  without  affecting  other  portions  of  the  program.  The  associated 
factors  are  listed  in  Table  5. 4. 2- 3.  They  were  tailored  from  the  GE/RADC 
set  (McCall,  1980,  Vol.  II)  based  on  later  findings  by  GE/RADC  (McCall 
1980,  Vol.  I)  and  on  the  emphasis  placed  on  the  complexity  attribute  by 
Hughes.  The  tailoring  scheme  is  explained  in  Table  5. 4. 2-2. 
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TABLE  5. 4. 2-2.  TAILORING  SCHEME  FOR  MAINTAINABILITY  FACTORS 


GE/RADC  Factors* 

Action 

Justification 

Consistency  Checklists  (CS.1.2)** 

Deleted 

Not  predictive 

Design  Structure  Measure  (SI.l) 

Retained 

Designated  CO.l 

Structure  Language  Check  (SI. 2) 

Deleted 

Not  generally  applicable*** 

Complexity  Measure  (SI. 3) 

Modified 

Redefined  and  designated 
CO.  2 

Coding  Simplicity  Measure  (SI. 4) 

Retained 

Designated  CO .  3 

Stability  Measure  (MO.l) 

Deleted 

Too  difficult  to  measure*** 

Modular  Implementation  Measure 
(MO.  2) 

Retained 

-- 

Quantity  of  Comments  (SD.l) 

Deleted 

Not  sensitive  enough 

Effectiveness  of  Comments  Measure 
(SD.2) 

Retained 

-- 

Descriptiveness  of  Implementation 
Language  Measure  (SD.3) 

Retained 

-- 

Conciseness  Measure  (CO.l) 

Deleted 

Too  difficult  to  calculate 

♦McCall  (1980,  Vol.  II) 

♦♦Abbreviation  references  in  parentheses  relate  to  definitions  in  McCall  (1980,  Vol.  I) 
♦♦♦McCall  (1980,  Vol.  I) 


TABLE  5. 4. 2-3.  SUGGESTED  MAINTAINABILITY  FACTORS 

•  Design  Structure  (CO.l) 

•  Use  of  Structured  Language  (CO. 2) 

•  Data  and  Control  Flow  Complexity  (CO. 3)* 

•  Logical  Complexity  (CO. 4)* 

•  Effectiveness  of  Comments  (SD.2) 

•  Descriptiveness  of  Implementation  Language  (SD.3) 

•  Modular  Implementation  (MO. 2) 

♦Measured  at  intramodule  level  (see  Table  5. 4. 2- 7). 
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5. 4. 2. 5  System  Compatible  SW  Maintainability  Metrics 

The  system  downtime  resulting  from  a  SW  failure  is  dependent  on  the 
restoration  technique  called  for  (see  Table  5. 4. 2-1)  and  the  maintainability 
characteristics  of  the  SW.  If  the  system  can  be  restored  by  startover  or 
switchover  techniques,  the  system  downtime  is  not  significant  (usually  less 
than  5  minutes).  If,  however,  the  SW  requires  some  design  correction  (i.e., 
a  SW  patch  or  recompile /reassemble)  for  system  restoration,  the  system  down¬ 
time  can  be  very  significant.  The  amount  of  downtime  in  this  case  is  greatly 
dependent  on  the  maintainability  characteristics  of  the  SW. 

The  mean  system  restoration  time,  x<^,  due  to  a  SW  failure  is  defined  by: 

t  a  t'  —a  t"  q„  Tl<l 

TSW=  1  SW+  2  SW  +  3  SW  (5.4. 2-1) 

where: 

a p  a*,  °3  =  relative  frequencies  of  occurrence  of  startover/switchover,  patch, 
and  recompile/reassemble,  respectively. 

igW  =  average  time  for  startover/switchover. 

Ts"w  =  averai?e  l™6  f°r  patch. 

TSW  =  average  time  for  recompile/reassemble. 

O 

Table  5. 4. 2-4  provides  some  typical  values  for  XgW  applicable  to  C 
systems.  These  recovery  times  are  based  primarily  on  the  systems  operating 
configuration  and  are  independent  of  the  maintainability  characteristics  of  the 
SW. 

Values  for  xg^  and  are  based  on  characteristics  of  SW  maintainability. 
A  promising  approach  to  developing  a  quantitative,  predictive  SW  maintain¬ 
ability  measure  is  derived  from  the  suggested  maintainability  factors  of 
Table  5. 4. 2-3.  These  maintainability  factors  are  aggregated  at  the  CPCI  or 
functional  level.  Then  the  resulting  CPCI  values  for  the  entire  SW  subsystem 
are  averaged  (weighted  by  probability  of  execution)  to  arrive  at  a  system 
level  SW  maintainability  figure  of  merit  (FOM).  This  FOM  is  then  translated 
to  a  SW  correction  time  based  on  the  restoration  technique  employed  for  xg^ 
and  x^- 

The  SW  maintainability  FOM,  6m.  for  an  aggregate  of  n  CPCI's  is  defined 
by: 

n 

6m  '  I  ft  -»<EK>  <5-4-2-2> 

K=1  K 

where  Fjk  is  the  combined  maintainability  measure  for  the  K**1  CPCI  and  p(Ek)  is 
the  relative  frequency  of  CPCI  execution  during  a  typical  or  average  mission 
scenario.  The  maintainability  measure  is  averaged  across  CPCI  and  module  level 
factors  as  follows: 
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where: 


(5. 4. 2-3) 


FMjR  =  1/2(C02  +  C03) 

Fc  =  1/3  (SD3  +  C01  +  M02) 

K 

=  Number  of  modules  in  the  K**1  CPCI 

CO,,  CO„,  COr  SD„  and  MO„  =  Maintainability  Factors  unique  to  the  j1*1 
1  1  6  *  1  module  in  the  k*h  CPCI  (see  Table  5.4. 2-3) 


Finally,  the  FOM  given  in  (5. 4. 2- 2)  can  be  converted  to  SVV  correction 
times,  igw  and  Tgw>  by  scaling  values  of  6m  to  individual  SW  activities 
according  to  the  correction  method  based  on  user  experience.  Tables  5. 4. 2-5 
and  5. 4. 2-6  provide  ranges  of  values  for  these  activities  based  on  actual  SW 
maintenance  experience  on  C3  projects  developed  at  Hughes.  The  term 
"regression  testing"  listed  in  the  table  is  an  activity  involving  both  methods 
of  SW  correction  which  is  concerned  with  assuring  that  satisfactory  SW  per¬ 
formance  already  attained  and  tested  is  not  perturbed  by  implementation  of 
the  correction.  The  following  relationships  convert  6^  values  to  SW  correc¬ 
tion  time  based  on  the  correction  method: 


Tg^  =  14.5-13  6^  (for  patch  correction)  (5. 4. 2-4) 

Tg'^  =  19-16.6  6^  (for  recompile /reassemble)  (5. 4. 2- 5) 

The  relative  frequencies  ai,  a2  and  a3  in  (5. 4. 2-1)  are  determined  from 
the  system  operating  configuration  and  user  experience.  Typically,  for 
systems  developed  at  Hughes  these  values  are  0.6,  0.3  and  0.1,  respectively. 
The  corresponding  transition  rate  for  system  restoration  due  to  a  SW  failure 
is  defined  in  terms  of  igyy: 


ysw  TSW 


(5. 4. 2-6) 


It  must  be  emphasized  that  the  relationships  of  (5.4. 2-4)  and  (5.4. 2-5)  were 
derived  from  Hughes  experience  and  may  be  different  for  each  SW  contractor. 

It  should  also  be  noted  that  the  method  of  combining  module-level  metric 
values  with  CPCI -level  metric  values  deserves  further  study.  Some  researchers 
propose  a  percentile  form  rather  than  the  normalized  form  used  here  (refer  to 
Table  5. 4. 2-7),  while  others  propose  multiplying  module-level  values  rather 
than  summing.  It  was  not  within  the  scope  of  this  study  to  investigate  the 
accuracy  of  alternate  methods  of  combining  metrics.  Consequently  the  straight¬ 
forward  methods  of  normalization  and  summation  were  selected. 
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TABLE  5. 4. 2-4.  TYPICAL  CJ  SYSTEM  RECOVERY  TIME  AFTER 

SOFTWARE  FAILURE 


Technique 

Recovery  Time  Range  (min) 

Automatic  startover 

0.5  -  1.0 

Automatic  switchover 

1.0  -  2.0 

Semiautomatic  startover 

2.0  -  3.0 

Manual  reload 

3.0  -  5.0 

Reinitialization  (including  course  and 

5.0  -  10.0 

fine  synchronization) 

Average 

2.3  -  4.2 

TABLE  5. 4. 2-5.  TYPICAL  C3  SYSTEM  SW  CORRECTION  TIME 

FOR  PATCH  METHOD 


Locate  software  fault 


Activity 


Correction  Time  Range  (hr) 


0.5  -  8.0 


Design  correction 

Implementation  correction  in  machine  representation 
form 


0.25  - 


•  Find  unused  storage  area 

•  Select  instruction  to  patch 

•  Implement  patch 


0.08  - 
0.08  - 
0.08  - 


Test  correction 
Regression  test 


0.25  - 
0.25  - 


2.0 


0.25 

0.25 

0.5 

1.0 

1.0 


Total 


1.5  -13.0 


i 
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TABLE  5. 4. 2-6.  TYPICAL  C3  SYSTEM  SW  CORRECTION  TIME 
FOR  RECOMPILE /REASSEMBLY  METHOD 


Activity 

Correction  Time  Range  (hr) 

Locate  software  fault 

0.5  -  8.0 

Design  correction 

0.5  -  4.0 

Implement  correction  in  source  form 

0.5  -  2.0 

Recompile  /Reassemble 

0.1  -  0.5 

Test  correction 

0.25-  0.5 

Integrate  into  new  version 

0.25-  1.0 

Regression  test 

0.25-  1.0 

2.4 


Total 


-  17.0 


TABLE  5.4. 2-7.  DESCRIPTION  OF  SW  MAINTAINABILITY  METRICS 


Applicable 

Phase 

Metric 

Design 

Coding 

SD. 

2  EFFECTIVENESS  OF  COMMENTS 

(1) 

Modules  have  standard  formated  prologue 

X 

j  #  modules  that  violate  rule 
total  #  modules 

(2) 

Comments  set  off  from  code  in  uniform  manner 

X 

^  #  modules  that  violate  rule 

total  #  modules 

(3) 

All  transfers  of  control  and  destinations  commented 

X 

^  #  modules  that  violate  rule 

total  #  modules 

(4) 

All  machine  dependent  code  commented 

^  #  modules  that  violate  rule 

total  #  modules 

X 

(5) 

All  non-standard  HOL  statements  commented 

X 

2  #  modules  that  violate  rule 

total  #  modules 

(6) 

Attributes  of  all  declared  variables  commented 

j  #  modules  that  violate  rule  . 
total  #  modules 

X 

(7) 

Comments  do  not  just  repeat  operation  described 
in  language 

X 

y  #  modules  that  violate  rule 
total  #  modules 

SD. 

3  DESCRIPTIVENESS  OF  IMPLEMENTATION 

LANGUAGE 

(1) 

High  order  language  used 

j  #  modules  with  direct  code 
total  #  modules 

X 

(2) 

Variable  names  (mnemonic)  descriptive  of  physical 
or  functional  property  represented 

X 

j  #  modules  that  violate  rule 
*  total  #  modules 

(3) 

Source  code  logically  blocked  and  indented 

1  #  modules  that  violate  rule _ _ 

total  #  modules 

X 
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TABLE  5. 4. 2-7.  DESCRIPTION  OF  SW  MAINTAINABILITY  METRICS  (Continued) 


Applicable  Phase 

Metric 

Design 

Coding 

(4) 

One  statement  per  line 

X 

,  #  continuations  +  multiple  statement  lines 

total  #  lines 

CO. 

1  DESIGN  STRUCTURE 

(1) 

Design  organized  in  top  down  fashion 

X 

X 

(2) 

Module  processing  not  dependent  on  prior 
processing 

X 

(3) 

Modules  have  single  entrance,  single  exit 

X 

(  1  ,  1  \ 

\#  entrances  #  exits/ 

(4) 

Compartmentalization  of  data  base 

X 

/  size  \ 

\#  files  / 

CO. 

2  DATA  AND  CONTROL  FLOW  COMPLEXITY 

(1) 

Module  Size  Profile 

X 

(2) 

Cyclomatic  Number, 

X 

X 

(3) 

Variable  Liveness 

/  #  live  variables  \ 

v 

\#  possible  live  variables/ 

(4) 

Logical  Stability  (Yau  1980) 

X 

(5) 

Module  flow  top  to  bottom 

X 

CO. 

3  LOGICAL  COMPLEXITY 

(1) 

Negative  Boolean  or  complicated  compound  Boolean 
expressions  used 

X 

L  #  of  above  \ 

\  #  executable  statements  ) 

(2) 

Jumps  in  and  out  of  loops 

X 

/#  single  entry /single  exit  loops) 

V  total  #  loops  / 

(3) 

Loop  index  modified 

X 

#  loop  indices  modified] 

V  total  #  loops  ' 

(4) 

Module  is  not  self  modifying 

X 

#  constructs 
total  lines  of  code 

K  OC 

i 
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TABLE  5. 4. 2-7. 


DESCRIPTION  OF  MAINTAINABILITY  METRICS  (Continued) 


Applicable  Phase 


Metric 


(5)  Number  of  statement  labels 

L  _ #  labels _ > 

V  #  executable  statements; 


(6)  Nesting  level 


Design  Coding 


Vmax  nesting  level/ 

(7)  Number  of  branches 

/j  _ #  branches _ \ 

\  #  executable  statements^ 

(8)  Number  of  GOTOs 

L  #  GOTO  statements  \ 
Y  #  executable  statements^ 

(9)  Variable  mix  in  a  module 

/#  internal  variables) 

\  total  #  variables  / 

(10)  Variable  density 

A  _ #  variables _ \ 

\  #  executable  statements  ' 


MO. 2  MODULAR  IMPLEMENTATION 

(1)  Hierarchical  structure 

(l  -  #  violations  of  hierarchy\ 

V  total  #  modules  / 

(2)  Controlling  parameters  defined  by  calling  module 

#  control  variables 
#  calling  parameters 

(3)  Input  data  controlled  by  calling  module 

(4)  Output  data  provided  to  calling  module 

(5)  Control  returned  to  calling  module 

(6)  Modules  do  not  share  temporary  storage 


Section  6.0 


RELIABILITY  TRADEOFF  METHODOLOGY 


6.1  GENERAL  APPROACH 

The  general  HW/SW  reliability  tradeoff  problem  can  be  described  as  the 
selection  of  the  appropriate  mix  of  hardware  and  software  complexity  to 
achieve  desired  reliability  characteristics.  Such  tradeoff  studies  are  of 
course,  restricted  to  mission  tasks /functions  which  can  be  accomplished 
using  either  HW  or  SW  or  mixtures  of  both  (such  functions  are,  for  example, 
graphic  display  generation  functions  and  peripheral  controllers) . 

Other  tradeoff  studies  are  possible.  For  example,  increasing  BIT  capa¬ 
bility  by  increasing  SW  complexity  will  reduce  MTTR  and  hence,  increase 
availability  thus  allowing  the  possibility  of  reducing  HW  complexity  (e.g. 
eliminate  some  redundancy)  while  still  maintaining  adequate  availability.  This 
is  not  a  direct  tradeoff  of  HW  and  SW  complexity,  but  rather  a  tradeoff  be¬ 
tween  MTBF  and  MTTR.  Such  tradeoffs  have  been  analyzed  before  but  not 
from  the  standpoint  of  a  model  which  adequately  combines  both  HW  and 
SW  failures.  The  primary  interest  in  this  section,  however,  is  in  the  trade¬ 
off  between  HW  and  SW  complexity  for  the  purpose  of  performing  specific 
mission  tasks. 

To  perform  a  given  tradeoff  analysis  it  is  necessary  to  fix  the  basic 
system  configuration  (e.g.  with  respect  to  HW  redundancy)  as  it  applies 
to  mission  tasks  not  related  to  the  tradeoff  task.  Also,  as  seen  earlier, 
steady  state  availability  is  not  an  adequate  figure  of  merit  for  evaluating 
performance  of  systems  possessing  both  HW  and  SW  failures  because  of  the 
possible  undershoot  of  availability  below  steady  state.  A  better  measure  is 
the  minimum  availability 

A*  =  min  A(t)  6.1-1 

t  >  0 

where  A ( t )  is  the  system  availability  as  a  function  of  time  (c.f.  3.3-1).  An 
alternate  measure  which  can  be  us^d  if  a  particular  mission  length  T  is  important 
is  the  average  availability  over  the  mission  length  defined  by 
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6.1-2 


T 

At  =  T  1  f  A(t)  dt. 

■'o 

If  T  is  large,  however,  the  average  availability  measure  will  simply  reduce  to 
the  steady  state  availability.  With  all  other  quantities  fixed,  then  the  tradeoff 
can  be  performed  by  computing  A*  or  Ax  for  varying  combinations  of  HW  and  SW 
or,  if  there  is  a  fixed  availability  requirement,  an  isometric  curve  relating  HW 
complexity  to  SW  complexity  to  achieve  the  fixed  requirement  can  be  generated. 

6.2  HW  COMPLEXITY 

HW  complexity  is  defined  in  terms  of  total  HW  failure  rate  usually  derived 
from  piecepart  counts  along  with  generic  classification  and  use  of,  for  example,  MIL- 
HDBK-217.  To  perform  a  tradeoff  analysis,  it  is  necessary  to  fix  all  HW  fail¬ 
ure  rates  which  are  not  involved  in  the  mission  task  for  which  HW  and  SW  are 
being  traded.  For  further  discussion  on  the  development  of  HW  failure  rates, 
see  Section  5.  3. 1. 

6.3  SW  COMPLEXITY 

For  the  baseline  model  the  SW  failure  rate  is  directly  proportional  to 
the  number  of  faults  remaining  in  the  SW.  Since  the  number  of  faults  is 
changing  randomly  in  time  it  does  not  make  sense  to  use  SW  failure  rate  in 
the  same  sense  as  HW  failure  rate.  However,  assuming  the  constant  of  pro¬ 
portionality  $  to  be  fixed ,  the  SW  failure  rate  when  the  system  is  delivered 
is  i>N  where  N  is  the  number  of  faults  in  the  SW  upon  delivery.  Thus,  HW 
failure  rate  can  be  traded  against  the  initial  SW  failure  rate  <J)N .  It  must  be 
emphasized  here  that  c)>  is  fixed  and  SW  failure  rate  decreases  in  units  of  C  as 
faults  are  corrected. 

The  constant  <J>  can  be  found  either  from  historical  SW  debugging  data 
(e.g.,  estimated  as  in  Schafer,  et.  al.  1979)  or  known  from  projects  employing 
similar  SW.  The  value  of  N  has  been  shown  to  be  related  to  SW  complexity 
metrics  referred  to  in  Section  5.  3.  2. 

The  SW  fault  correction  rate  c  =  Ap  must  also  be  fixed  for  the  tradeoff 
analysis  since  it  is  based  on  the  user's  SW  support  plans.  The  value  of  A 
is  based  on  the  maintenance  philosophy  and  intensity  with  which  effort  is 
made  in  troubleshooting  and  correcting  faults.  The  value  of  p,  the  probability 
of  a  successful  debug,  can  be  estimated  from  historical  SW  debugging  data 
which  is  dependent  on  factors  such  as  the  skill-level  of  the  user's  SW 
maintenance  personnel  and  SW  complexity  metrics. 

6.4  TRADEOFF  PROCEDURE 

The  first  step  in  performing  a  HW/SW  tradeoff  analysis  is  to  define  the 
system  structure  in  terms  of  HW/SW  reliability  configuration,  maintenance 
policies,  etc.  The  mission  tasks  for  which  HW  and  SW  are  to  be  traded 
should  be  identified  at  this  stage. 

The  next  step  is  to  estimate  the  static  parameters  needed  to  complete  the 
model  such  as  0,  p,  HW/SW  repair  rates,  and  HW  failure  rates.  The  range 
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of  interest  (and/or  feasibility)  for  the  initial  number  of  SW  faults,  N,  and  XH, 
the  failure  rate  associated  with  the  HW  portion  of  the  tradeoff,  should  also  be 
defined . 

Finally,  using  the  baseline  model  A*  (or  At)  can  be_computed  for  values 
of  (AH.  0N)  in  the  specified  range.  Alternately,  A*  (or  At)  can  be  fixed  at 
a  required  level  and  values  (Ah,  0n)  can  be  found  (by  exercising  the  model) 
which  yield  the  specified  A*  (or  At)  and  an  isometric  curve  can  be  generated. 
The  latter  approach  is  useful  both  as  a  design  tool  and  as  a  tool  to  be  used  in 
selecting  HW/SW  complexity  requirements  (see  Figure  6.4-1). 


Figure  6.4-1.  Isometric  Curve  of  Constant  A*  as  a  Design  Guide 


6.5  EXAMPLES 

6.5. 1  Series  Constructs 

In  the  example  of  Section  4.1-2,  there  were  three  states:  Full-up  (HS);  HW 
up,  SW  down  (H5);  HW  down,  SW  up  (HS).  For  this  example,  0=  0.01,  c  =  Ap 
=  0. 1,  and  h-S  =  H-H  =  2-  T^e  vaiues  °f  0N  Ah  were  varied  to  achieve  A*  (approxi¬ 
mated  using  the  program  in  Appendix  F)  equal  to  0.97.  Figure  6.5-1  shows  the 
resulting  isometric  curve.  In  generating  Figure  6.  5-1,  the  computer  program 
was  run  for  N  =  1,  2,  ....  10  SW  bugs,  initially,  and  varying  Ah  until  the  mini¬ 
mum  value  of  A(t) ,  t  =  1,  2,  ...  (=  probability  of  being  in  state  "0"  since  this 
is  the  only  operational  state)  reached  0.97  to  3  decimal  places.  A  similar  curve 
could  be  generated  for  At  by  running  the  computer  program  as  before  but  using 

T 

AT  ~  T+I  X 
t=0 


as  an  approximation  to  (6.1-2). 
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Figure  6.5-1.  Isometric  Curve  of  A*  =  0.97  for  the  Series  Example 


Refer  to  Section  4.  2-2,  and  the  five-state  case.  In  this  example,  we  took 
M  =  9,  Nh  =  10,  c  =  \p  =  0.  70,  ps  =  1,  M-h  =  2.  and  <?=  0.  001  and  varied  \h  and 
to  achieve  A*  =  0.  99.  Figure  6.  5-2  shows  the  resulting  isometric  curve 
generated  with  the  help  of  the  computer  program  of  Appendix  F . 
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Figure  6.5-2.  Isometric  Curve  of  A*  =  0.99  for  the  Redundant  HW  Example 
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Section  7.0 


CONCLUSIONS  AND  RECOMMENDATIONS 


A  theory  for  combining  HW  and  SW  reliability  models  has  been  developed 
and  used  to  derive  a  general  combined  HW/SW  reliability  model.  This  model 
provides  an  accurate  description  of  the  reliability /maintainability  characteristics 
of  systems  possessing  both  HW  and  SW. 

The  general  HW/SW  model  was  applied  to  simple  reliability  constructs 
using  a  Markov-type  SW  process  and  extended  to  more  complex  reliability 
constructs  including  specific  application  to  an  example  C"  system.  Under  the 
assumptions  of  this  SW  process,  the  HW/SW  model  is  compatible  with  the 
maintenance  philosophies  employed  for  C^  systems.  In  particular,  recovery 
of  the  SW  operating  system  without  correcting  a  fault,  imperfect  debugging, 
and  numerous  HW/SW  modes  of  interaction  are  all  features  of  the  model  which 
make  it  flexible  enough  to  handle  most  systems. 

Analyses  of  availability  concepts  using  the  HW/SW  model  indicate  the 
inadequacy  of  the  commonly  used  "steady -state"  availability  as  a  valid  figure 
of  merit,  a  fact  which  has  been  observed  in  C3  systems.  The  availability  of 
a  typical  system,  being  time  dependent,  will  exhibit  the  transient  effect 
of  imperfect  SW  correction  in  the  form  of  an  undershoot  below  the  steady- 
state  value.  The  magnitude  and  duration  of  this  transient  are  determined  by 
the  SW  maintenance  policy,  the  initial  number  of  faults  in  the  SW,  the  SW 
failure  rate  for  a  single  fault,  and  the  relative  magnitude  of  SW  failure  rate 
and  HW  failure  rate.  These  phenomena  are  reflected  by  parameters  in  the 
model.  The  SW  maintenance  policy  is  affected  by  the  rate  at  which  mainten¬ 
ance  teams  correct  faults  (this  rate  is  c  =  Ap  where  A  is  the  rate  at  which 
attempts  are  made  and  p  is  the  probability  of  a  successful  correction)  .  The 
initial  number  of  SW  faults  is  N  and  the  SW  failure  rate  associated  with  each 
fault  is  4> .  When  HW  dominates  the  system  in  terms  of  failure  rate,  the 
transient  undershoot  of  availability  below  steady  state  can  be  eliminated 
altogether.  When  HW  and  SW  failure  rates  are  of  approximately  the  same 
order  of  magnitude,  the  values  of  c  (rate  of  fault  correction)  and  4>  deter¬ 
mine  the  transient.  With  <J>  held  fixed,  the  length  of  time  availability  stays 
below  steady  state  increases  as  c  decreases  and  decreases  as  c  increases, 
while  with  c  fixed,  the  maximum  value  of  the  undershoot  increases  as  $ 
increases,  and  decreases  as  decreases  (see  figure  4.1-3). 
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The  implication  of  the  foregoing  has  an  obvious  impact  on  specifying 
availability  requirements  for  systems  which  contain  embedded  SW.  The  design 
and  analysis  of  systems  possessing  both  HW  and  SW  should  be  carried  out  from 
the  standpoint  of  minimum  availability  instead  of  the  sometimes  over-optimistic 
steady-state  availability. 

Using  the  experimental  computer  program  representation  of  the 
HW/SW  model,  two  approximations  were  considered:  1)  "lumping"  the  SW 
failure  and  repair  rates  (e.g.,  using  the  "initial"  SW  failure  rate)  with  the 
corresponding  HW  failure  and  repair  rates,  and  2)  calculating  the  separate 
availability  of  HW  and  SW  and  then  combining  these  to  obtain  a  total  HW/SW 
availability 

A(t)  =  A(t)  •  A(t) 

T  HW  SW 

Reasonable  approximations  could  be  obtained  in  general,  however,  only  for  the 
series  case.  The  accuracy  of  lumping  the  HW  and  SW  failure  and  repair  rates 
is  dependent  on  the  amount  of  SW  present  in  the  system,  becoming  worse  as 
steady-state  is  reached  in  a  SW -dominant  system.  Calculating  separate  HW  and 
SW  availabilities  provides  a  more  accurate  approximation  than  "lumping" , 
although  the  difficulty  in  computing  the  SW  availability  term  would  probably  also 
necessitate  the  use  of  a  computer.  For  a  redundant  HW  configuration,  the 
approximation  errors  in  these  two  cases  are  dependent  on  the  number  of  HW  units 
and  the  number  of  bugs  initially  present  in  the  system,  and  can  result  in  a  very 
poor  approximation  in  the  transient  period. 

In  order  to  make  the  model  proposed  in  this  study  more  practical  it  will  be 
necessary  to  develop  a  comprehensive  computer  program  for  its  implementation 
with  a  step-by-step  user's  guide  on  detailed  applications.  The  computer  pro¬ 
gram  documented  in  this  study  is  but  an  experimental  version  of  such  a  pro¬ 
gram  and  serves  only  as  a  model  for  a  more  comprehensive  computer  program. 
An  effort  to  develop,  from  the  experimental  program  or  other  means,  a  pro¬ 
gram  which  can  be  used  interactively  to  compute  the  relevant  reliability  mea¬ 
sures  and  perform  tradeoffs  over  wide  ranges  of  the  parameters  is  therefore 
recommended. 

In  addition  to  the  development  of  an  interactive  computer  program,  a 
detailed  study  of  the  statistical  aspects  of  estimating  the  model  parameters 
including  confidence  estimation  is  needed.  Such  a  study  would  involve  the 
unification  of  techniques  (which  are  at  this  time,  scattered  about  the  SW/HW 
reliability  literature) for  estimating  the  model  parameters  and  the  study  of 
the  statistical  properties  of  the  estimators. 

Once  these  statistical  properties  have  been  established,  it  is  also  recom¬ 
mended  that  a  combined  HW/SW  test  methodology  be  developed  similar  to  that 
developed  for  MIL-STD-781  "Reliability  Tests:  Exponential  Distribution". 
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APPENDIX  A 


IMPORTANT  NOTATIONS  AND  DEFINITIONS 


c  (=Ap) 
F 

TC 

P.(. 


P 

A 


SW  fault  correction  rate. 

Combined  SW  maintainability  measure. 

Solution  to  the  Kolmogorov  differential  equations  correspond¬ 
ing  to  the  system  state  transition  diagram  when  there  are  j  bugs 
in  the  SW. 

Probability  of  perfect  debug. 

Rate  at  which  SW  maintenance  personnel  attempt  to  find  and 
fix  SW  faults. 


X(t) 

Y(t) 

0,  1,  . 

OP 


t0.  tr 


absorbing  state 


Number  of  SW  faults  at  time  t. 

System  state  at  time  t. 

Totality  of  system  states;  0  =  full-up. 

The  set  of  system  operational  states. 

Jump  times  for  X(t). 

An  absorbing  state  is  a  state  such  that  once  a  process 
reaches  it,  the  process  remains  there  forever. 


M 

P(E,> 

0 


WS 


Total  number  of  HW  units. 

Number  of  HW  units  required. 

Probability  of  module  execution. 

Constant  of  proportionality  related  to  SW  failure 
rate. 

HW  failure  rate. 

HW  repair  rate. 

SW  "repair"  rate. 

HW  standby  unit  duty  cycle. 

SW  maintainability  figure  of  merit. 


APPENDIX  B 


PURELY  DISCONTINUOUS  MARKOV  PROCESSES 

The  material  in  this  section  is  adapted  from  Kannan  (1979),  Feller  (1957, 
1966),  and  Hoel,  Port  and  Stone  (1972).  Throughout  this  report,  all  random 
processes  will  be  defined  on  the  same  probability  space  (E,  F,  P)  where  E  is 
the  set  of  elementary  events,  F  is  a  sigma  algebra  of  subsets  of  E,  and  P  is  a 
probability  measure  defined  on  F . 

By  stochastic  process  (sp)  is  meant  a  collection  of  random  variables  X(t), 
t  _>  0  defined  on  (E,  F,  P).  The  range  of  X(t)  will  be  assumed  to  be  a  finite 
collection  of  numbers  which  can  be  taken,  without  loss  of  generality,  to  be 
(0,  1,  2,  J}.  When  X(t)  is  taken  to  be  the  state  of  the  SW,  for  example, 

X(t)  =  k,  0  £k  <  J,  represents  the  number  of  faults  remaining  in  the  SW. 
Similarly,  when  5f(t)  is  taken  to  be  the  state  of  a  system,  X(t)  =  k, 
l  <  k  <  J,  could  be  degraded  and  failed  states  while  X(t)  =  0  represents  the 
full-up  state.  In  these  examples,  the  index  variable  t  is  assumed  to  be  the 
time . 

The  sp  {X(t),  t  >  0}  is  called  a  Markov  process  if  for  0  <tQ  <  tj  <...<  tn 
<s  <  t  and  integers  xQ,  xi,  . . . ,  xn,  x,  y  in  {0,  1,  . . . ,  J} : 

P  (X(t)  =  y  |X(tk  )  =*k,  0  £k  <^n,  X(s)  =  x] 

=  P  {X(t)  =  y  |X(s)  =  x}.  (B.l-1) 

If,  in  addition  to  (B.l-1)  ,  for  all  t  >  s  >  0 

P  (X(t)  =  y  I X ( s )  =  x}  =  P{X(t-s)  =  y  |X(0)  =  x}  (3.1-2) 

the  process  is  called  a  time-homogeneous  Markov  process.  The  property 
(B.l-1)  can  be  interpreted  by  saying  that  the  process’  future  is  independent 
of  its  past,  evolution,  given  the  present  value  of  the  process.  When  both 
(B.l-1)  and  (B.l-2)  are  satisfied,  the  function  p(h,x,y)  =  P{X(h)  =  y|X(0)  = 
x}  is  called  the  transition  function  of  the  process.  Both  (B.l-1)  and  (B.l-2) 
will  be  assumed  hereafter. 

The  function  q(x),  xt{0,l . J}  is  called  the  intensity  function  of  the 

Markov  process  if  q(x)At  +  o(At)  is  the  probability  that  X(t)  will  undergo  a 
random  change  in  the  time  interval  (t,  t+At;  when  X(t)  =  x.  The  conditional 
probability  Q(x,y)  that  X(t+At)  =  y  given  that  X(t)=  x  and  a  change  takes  place 
in  (t,  t+At),  is  called  the  relative  transition  function  of  the  Markov  process.  It 
is  clear  that  for  all  x,  ye{0,  1,  ...,  J): 


B  - 1 


q(x)  >_0 
Q(x,x)  =  0 


(B.l-3) 


E 


x,k)  =  1 


The  stationary  Markov  process  {X (t) ,  t  ^0}  is  called  a  purely  discontinuous 
Markov  process  if  in  an  arbitrary  time  interval  (t,  t+At),  X(t)  undergoes  a 
change  with  probability  q(x)  At  +  o(At),  remains  unchanged  with  probability 
l-q(x) At  +  o(  At)  and  undergoes  more  than  one  change  with  probability  o(At). 
Thus,  if  {X (t ) ,  t  5^0}  is  a  purely  discontinuous  stationary  Markov  process  with 
transition  function  p ,  intensity  function  q ,  and  relative  transition  function  Q , 
it  follows  that  for  s,  t  ^  0,  x,ye{0,l,...,J}: 


p ( s  ,x  ,y )  = 


1  -  q(x)s  +o  (s);  if  x  =  y 
sq(x)  Q(x,y)  +  o(s)  ;  if  x  4  y 


(B.l-4) 


p ( s+t ,  x,  y) 


J 


1 


k=0 


p(t,x,k)  p(s,k,y) 


(B.l-5) 


Substituting  tB.1-4)  into  (B.l-5)  yields 
p(s+t,x,y)  =  p(t,x,y)(l-q(y)  s+o(s)) 


p(t,x,k)  [sq(k)  Q(k,y)  +  o(s)] 


so  that 


p(s+t,x,y)  -  p(t ,x ,y)  =  -sq(y)  p(t,x,y) 

+  ^  p(t,x,k)  sq(k)  Q(k,y)  +o(s).  (B.l-6) 


k*y 

Dividing  (B.l-6)  by  s  and  letting  s  tend  to  zero  through  positive  values 
yields  the  so-called  Kolmogorov  forward  differential  equations. 


where 


J 

^-p(t,x,y)  =  ^  p(t,x,k)  A(k  ,y) ,  y-  0,1, ...,J 
k=0 


A(x ,y) 


j  -q(x)  if  x  =  y 

(  q(x)  Q(x,y)  if  x  i  y 


(B.l-7) 


(B.l-8) 
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The  values  A(x,y)  are  called  infinitesimal  parameters  (or  transition  rates 
when  x  4  y) .  Throughout  this  report  A  (or  A.)  will  denote  a  matrix  with 
elements  A(x,y)  (or  Aj(x,y)).  ] 

A  useful  fact  in  studying  purely  discontinuous  stationary  Markov  processes 
is  that  if  such  a  process  starts  in  state  x  at  some  time  t,  then  the  amount  of 
time  spent  in  that  state  is  exponentially  distributed  with  mean  y(x)  =  [q(x)]'l 
where  q  is  the  intensity  function  of  the  process.  With  this  fact  and  having 
derived  the  foregoing  mathematical  properties,  it  is  useful  to  describe  the 
evolution  of  such  a  process  (see  Figure  B.  1-1).  The  process  starts  at  time  0 
in  state  x<),  say.  It  remains  there  for  a  length  of  time  Ti,  and  jumps  to  state 
with  probability  Q(x0,  xi).  The  process  remains  in  state  xj  for  a  length  of 
time  T2  which  is  exponentially  distributed  with  parameter  q(xj),  and  then  jumps 
to  state  X2  with  probability  Q(xj,  X2);  etc.  The  paths  of  such  a  process  are 
step  functions  with  probability  1. 


Figure  B.l-1.  Typical  Path  of  a  Purely  Discontinuous  Stationary  Markov  Process 
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APPENDIX  C 

A  MARKOVIAN  SW  PROCESS 

In  this  section  a  special  case  of  a  purely  disegntinuous  stationary  Markov 
process  is  worked  out  and  shown  to  be  probabilistically  equivalent  to  the  Goel/ 
Okumoto  Imperfect  Debugging  model  (1978).  The  transition  function  is  derived 
in  closed  form  and  related  to  the  binomial  probability  function. 

Assume  that  the  number  of  software  faults  present  is  a  system  at  time  t . 
X(t),  is  a  purely  discontinuous  stationary  Markov  process  {X(t),  t  i  0}with 
range  {0,  1,  ....  N}  and  transition  diagram  given  by: 


The  interpretation  of  this  diagram  is  as  follows.  Supposing  that  the  proc¬ 
ess  starts  in  state  N  at  time  zero  (i.e.,  X(0)  =  N),  the  next  transition  is  always 
to  N-l,  with  transition  rate  =  A(N,N-1)  =  Nc.  When  the  next  transition  occurs, 
it  always  takes  the  process  to  state  N-2,  with  transition  rate  A(N-1,  N-2) 

=  (N-l)c.  The  process  always  transitions  to  the  next  lowest  state  until  it  hits 
state  0  where  it  remains.  For  this  process,  the  parameters  described  in 
Appendix  B  are  given  by: 


q(x)  =  cx,x  =  0,  1,  ...,N;c>0, 

f  1  if  y  =  x  -  1 
Q(x,y)  =  I 

l  0  otherwise 


A(x ,y) 


(-cx ;  x  =  y,  0  <  x  <  N 
cx ;  y  =  x - 1 ,  1  <  x  s  N 


\  0  otherwise 


Such  a  process  is  called  a  linear  pure  death  process.  The  Kolmogrov  for¬ 
ward  differential  equations  are: 


p(t,x,x)  =  -cxp( t , x , x ) , 


9t 


p(t,x,y)  =  -cyp(t,x,y)  +  c(y+l)  p(t,x,y+l) ,  0  <  y  <  x  -  1 


(C.l-1) 


The  solution  of  this  system  of  equations  under  the  initial  conditions 


p(0,x,y) 


1 1  if  x  =  y  =  N 
jO  otherwise 

C-l 
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(i.c.  ,  X(0)  N  with  probability  1)  is 


p(t  ,N  ,y ) 


D.  ,  *  N-y 

e  C  y  ( 1-e  Ct )  .O^y^N 


From  the  definition  of  p,  (C .  1-2)  can  be  rewritten  as 


(C.1-2) 


P{X(t)  =  y  jX ( 0)  =  N  }  = 


e'Cty 


(1-e 


-ct 


N-y 

) 


,  0  <  y  <  N 


(C.l-3) 


In  the  Imperfect  Debugging  Model,  it  is  assumed  that  N  software  faults 
are  present  initially,  all  independent  of  one  another,  each  of  which  having 
constant  occurrence  rate  X  '•  0.  The  probability  of  two  or  more  faults  occur¬ 
ring  simultaneously  is  assumed  negligible  and  no  new  faults  are  introduced. 

At  most,  one  fault  is  corrected  at  correction  time  and  the  time  to  correct  a 
fault  is  neglected.  The  fault  causing  a  failure,  when  detected,  is  corrected 
with  probability  p,  0  <  p  <  1  and  not  removed  with  probability  q  =  1-p.  Let 
X*(t)  denote  the  number  of  faults  remaining  in  the  software  at  time  t.  It 
will  be  shown  that  X*(t)  has  the  distribution  (C.l-3)  with  X  replaced  with 
X*  and  c  =  Ip  so  that  X*  is  in  fact  equivalent  to  a  purely  discontinuous  sta¬ 
tionary  Markov  process,  a  special  case  of  the  more  general  semi-Markov  proc¬ 
ess  described  by  Goel  and  Okumoto. 


Equation  (3.27)  of  Goel  &  Okumoto  gives 


P{X*(t)  =  y  |X*(0)  = 


N}  =GN,y(t)  -  GN  ,y-l(t)  ’  y=0>l,...,N 


(C.l-4) 


where 


G., 

N  ,y 


(t) 


f: 


U=1 


;  y  =  N 

;y<-l  j 

N!  (-l)j  1  j  _-(y+j)pXt\03r<N-l 
y!j!  (N-y-j) !  (y+j)  V1  e  /’ 


(C.l-5) 


Expression  C.l-4  is  simplified  as  follows.  For  0  <_y  <_  N-l,  (C.l-5)  can 
be  used  to  give 


C-2 


GN,y(t)  -GN,y-l(t) 


1=1 


N?  .  j(l-e-(y+i>PXt) 

y!j!  (N-y-j)!  (y+j) 


+1 


(y-l) ! j!  (N-y+l-j)!  (y-l+j) 


KU-l)*'1  1  (l-e-<y*i>Pxt) 

y!j!  (N-y-j)!  (y+j) 


N!(-l)]  1(j+l)  (l-e~(y+])pXt) 
L  (y-D!  (j+l)!  (N-y-j)!  (y+j) 
]-0 


N!  (N-y)i(-l)1  1  (l-e  (y  ])pU)  (y+j) 

(N-y)!y!  2,  j!  (N-y-j) !  (y+j) 

)=0 


(-i)j  e-(y+j>pxt 


=  Q  e'ypu  (l-e-pU)N-y. 

The  correspondence  of  (C.l-3)  and  (C.l-4)  is  clear.  The  interpretation 
of  this  correspondence  is  that  the  Imperfect  Debugging  Model  is  equivalent  to 
a  linear  pure  death  process  starting  at  N  with  infinitesimal  parameters  defined 
by: 


f-pXx  ,  x  =  y 

A(x,y)  =  <  pAx  ,  y  =  x-1 

0  ,  otherwise. 

The  effect  of  imperfect  debugging  is  also  clear.  The  "perfect  debugging 
rate”  for  a  single  fault,  namely  A,  is  simply  reduced  by  a  factor  of  p,  the 
probability  of  successfully  correcting  the  fault. 

To  complete  the  theory  for  the  baseline  combined  HW/SW  reliability  model, 
it  is  necessary  to  derive  the  distributions  for  the  times  at  which  jumps  occur  in 
the  process  X ( t )  defined  in  the  beginning  of  this  section.  It  is  clear  from  the 


C-3 


structure  of  X(t)  that  the  time  at  which  the  kth  jump  occurs  (assuming  X(0)  =  N) 

is  the  sum  of  independent  exponentially  distributed  random  variables  ri . 

Tk  where  E-rj  =  l/[(N-j+l)c)] . 

Goel  and  Okumoto  compute  the  distribution  of  the  time  required  to  a  speci¬ 
fied  number  of  bugs  y  which  is  equivalent  to  the  distribution  of  the  time  of 
the  kth  jump  in  X*(t)  where  k  =  N-y.  The  expression  they  derive  is  given  by 
the  cumulative  distribution  defined  by  Gn.  y(t)  in  (C.  1-5)  with  OsysN-1. 

Having  derived  expression  (C.l-3)  for  the  X ( t)  process  the  cumulative 
distribution  function  of  the  time  tk  of  the  kth  jump  in  X(t)  can  be  derived  by 
noticing  that  the  event  {X(s)<N-k}  is  equivalent  to  the  event  {tk£s}  so  that 

N  -k 

p{tk<s}  =  p{x(s)<N-kJ  =  ^  (?)  e~Cti  (l-e"ct)N_j  (C.l-6) 

j=o  X 


for  siO  and  k  =  1 ,  2 ,  ...  N ,  the  distribution  of  to  being  degenerate  since 
to  =  0  will  be  a  convention.  If  ty  is  defined  the  time  at  which  X(t)  first 
reaches  y,  y  =  0,  1,  ....  N,  then  obviously  ty  =  tN  and  (C.l-6)  gives 


P 


( 1-e 


-ct}N-j 


(C.l-7) 


for  y  =  0,  1,  . . . ,  N-l  and  t>0.  If^  =  pX  and  X  is  replaced  by  X*  of  the 
Imperfect  Debugging  Model,  then  ty  is  the  time  at  which  there  are  y  remaining 
errors  and  (C.l-7)  is  equivalent  to  GN,y(t)  as  given  in  Goel  &  Okumoto  and 
reproduced  in  (C.l-5). 
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APPENDIX  D 


DEVELOPMENT  OF  A  COMBINED  HW/SW  RELIABILITY  MODEL 
USING  A  MARKOVIAN  SW  PROCESS 

In  this  section  some  criteria  will  be  presented  for  selecting  a  model  for 
the  SW  process  {X(t),  t>0}  and  these  criteria  will  be  used  to  show  how  the 
calculation  of  the  combined  HW/SW  reliability  model  (3.2-2)  can  be  accomplished. 

A  particular  model  for  the  SW  process,  namely  the  Goel/Okumoto  Imperfect 
Debugging  Model  described  in  Appendix  C ,  will  be  adopted  for  the  combined 
HW/SW  reliability  model  in  order  to  analyze  the  expressions  for  the  state 
occupancy  probabilities  and  derive  reliability  measured  defined  in  Section  3.3. 

D.l  CRITERIA  FOR  SELECTING  THE  SW  PROCESS 


Without  some  additional  assumptions  on  the  structure  of  the  process  X(t) 
(i.e.,  the  number  of  faults  remaining  in  the  SW  at  time  t)  there  is  little  use 
for  expression  (3.2-2).  So,  as  a  first  criterion  it  will  be  assumed  that,  in 
addition  to  X(t)  being  a  purely  discontinuous  time -stationary  Markov  process 
taking  values  in  (0,  1,  ...,  N}  and  having  X(0)  =  N,  it  will  further  be 
assumed  that  X(t)  always  moves  one  unit  to  the  left  at  each  transition.  That 
is,  X(t)  will  be  assumed  to  be  a  "pure-death"  process  in  the  sense  that  errors 
are  never  added  (births)  to  the  SW.  The  partition  0  =  tQ<ti  <...  <tN<tN+l 
=  tN+2  =  ...  =  +00  will  denote  the  random  times  at  which  X(t)  transitions. 
Mathematically,  this  implies  that  the  probabilities  Q(x,y)  defined  in  (B.l-3) 
take  the  form 


Q(x,y) 


1  if  y  =  x-1 
0  otherwise 

V 


CD .  1-1) 


and  moreover, 

p{x(tR)  =  N-k|X(0)  =  n}  =  1  (D.l-2) 

for  k  =  0,  1,  ...,  N. 

Physically  this  means  that  when  the  number  of  faults  in  the  SW  changes, 
it  always  decreases  by  1.  That  is,  bugs  are  not  added  by  maintenance  team 
interventions  and  no  more  than  one  bug  is  removed  at  a  time. 

Because  of  this  newly  imposed  structure  on  X,  it  is  more  convenient  to  com¬ 
pute  P {Y ( t )  =  n}  by  conditioning  on  {t^sKtk+l}  so  that 

N 

P{Y(t)  =  n}  =  £  p{Y(t)  =  n|tk«t<tktl}p{tkSt<tk+1)  .  (D.  1-3) 

k=0 


D-l 


Assuming  X(0)  =  N,  then  because  of  the  structure  of  X,  the  events 
{t^  2.  t  <  tk+1^  and  {X (t)  =  N-k}  are  equivalent  so  that  the  last  written 
probability  in  (D.  1-3)  is  computed  by  finding  the  distribution  of  X(t)  via  the 
theory  of  Appendix  B.  It  remains  to  express  P{Y(t)  =  n  |t]c<t<tic+i}  in  terms 
of  computable  quantities.  But,  because  of  the  construction  of  the  sample 
paths  of  Y(t)  it  is  easy  to  write  down  the  aforementioned  conditional  proba¬ 
bility  in  terms  of  the  pj  defined  by  (3.2-1),  i.e. 


p{y«)  =n|tkst<tk+1}  = 

J...  $  Z...1  Ptj(sr0’il)pN-l(s2’lr1Z)' 1  •pN-k  +  l(sk’1k-l’1k) 
pN-k(*  -  2  VVnj  '  S’---Tk+iK*kst<tk+i}<Sl . Wdsids2-"dsk.i 

/  (D.l-4) 

q 

t=1 . N’  *q  =  2  Tr  Ti*V‘i-r 

1  =  1 


1  =  1 


for  k= 


and 


p{Y(t)  =  n|0<t<tj  =  pN(t,0,n). 


(D.l-5) 


Here,  it  is  assumed  that  Y(0)  =  0.  The  multiple  integral  in  (D.l-4)  extends 
over  all  sj>.0,  ....  s^+i^O  such  that 

k  k+1 

2  s«st<  2  s<f 

ie=i  i=i 


f  is  the  joint  density  function  of  tj,  ....  Tk+1  conditioned  on  the  event 
{tk<Ktk+i},  and  the  multiple  sum  in  (D.l-4)  extends  over  integers  0<;ii<J,  .... 
Ori^J. 

Equation  (D.l-3)  can  thus  be  computed  by  finding  pj  (solving,  numerically 
possibly,  a  system  of  linear  differential  equations)  and  then  using  (D.l-4)  and 
(D.l-5)  to  obtain  P{Y(t)  =  n).  However,  an  additional  assumption  will  render 
(D.l-4)  and  (D.l-5)  more  tractable. 

Since  a  well  designed  system  will  spend  the  great  majority  of  time  in  some 
operational  state  (this  is  true  in  view  of  typical  availability  requirements)  it  is 
a  good  approximation  to  assume  that  the  state  of  the  system  at  random  times 
ti<t2<-..<tN  is  always  the  same  operational  state.  In  fact,  although  it  is  not 
necessary  to  do  so,  it  can  be  assumed  that  the  state  of  the  system  at  times 
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ti,  i  =  l,  ....  N  is  full-up.  As  an  assumption,  this  is  not  unreasonable  since 
repair  efforts  are  generally  undertaken  as  items  fail  so  that  a  system  is  most 
likely  to  occupy  the  full-up  state  immediately  following  a  down-state  or  degraded 
state.  In  practical  terms,  this  is  equivalent  to  implementing  SW  changes  only 
during  full-up  periods.  In  mathematical  terms,  this  is  equivalent  to  forcing  the 
points  0<tj<. .  .<  tjj  to  be  regeneration-points  of  the  process.  Thus,  it  will  now 
be  assumed  that  Y(0)  =  Y(ti)  =  ...  =  Y(tN)  =  0  where  0  represents  the  full-up 
state.  Under  this  provision,  (D.l-4)  can  be  further  simplified,  and  the  calcula¬ 
tions  performed  as  follows: 

For  l<k<N , 


rx  r 

{Y(t)  =  n|tkst<tk+1j  = 

JQ  P{Y(t)  "  nftk-t<tk+l’  *k  =  y) 

dP{‘ksy|*kst<tk.l}  = 

/J  PN-k(t-y’0’n)dP{tksy|tkSt<tk+l}-(D-  !-6> 

When  k  =  0,  expression  (D.l-5)  completes  the  calculation.  The  probability 
distribution  P {t^-y  jt^st<tk+l}  can  be  computed  when  the  process  X(t)  is 
specified . 

In  equations  (D.l-4)  through  (D.l-6)  it  has  been  assumed  that  Y(0)  =  0  with 
probability  1.  If  it  is  desired  to  have  Y(0)  =  i  where  i  is  an  arbitrary  opera¬ 
tional  state  it  is  necessary  only  to  change  pN(si,0,ii)  to  PN(sl>i>il)  in  (D.l-4) 
and  pN(t,0,n)  to  pN(t,i,n)  in  (D.l-5).  Also,  in  (D.l-4)  through  (D.l-6)  it 
is  tacitly  assumed  that  Y  can  reach  full-up  from  any  other  state. 

D . 2  THE  COMBINED  HW/SW  RELIABILITY  MODEL 


The  process  X(t)  described  in  Appendix  C  (and  shown  to  be  equivalent  to 
the  Goel/Okumoto  Imperfect  Debugging  Model)  satisfies  (D.l-1)  and  (D-l-2),  and 

P{X(t)  =  j|X(0)  =  N}  =  (^e‘cjt  (l-e'ct)N~j  (D.2-1) 


for  j  =  0,  1,  ....  N.  In  addition,  if  C^t0<ti<. . -<tjq  denote  the  jump  times  of 
X(t) ,  then  the  increments  Tj  =  tj-tj-i,  i  =  1,  ...,  N  are  independent,  exponen¬ 
tially  distributed  random  variables  with  densities  given  by 


fT.(s) 


[c(N-i+l)J  exp  {-s[c(N-i+l)J }  ,  s&O 
0,  s<0 

V 


(D.2-2) 


for  i  =  1,  ....  N.  In  (D.2-1)  and  (D-2-2),  c  =  pA  where  pe(O.l)  is  the  prob¬ 
ability  of  a  "perfect"  debug  and  A>0  is  the  rate  of  maintenance  troubleshoot¬ 
ing.  From  now  on,  this  X(t)  process  will  be  assumed. 

The  conditional  distribution  P{tk<y  |tkst<tk+l}  can  now  be  computed  for 


p{tk<y,  tk<t<tk+1J  =  P{tk<y,  tk<t<tk  +  rk+1} 

=  /  J  c(N-k)  e's[c(N"k)]  ds  dpjtk<sj 

=  /ye-c(N- k)(t-s)  f  (s)  dg 


(D.  2-3) 


where  ftk(s)  is  the  derivative  of  the  right-hand  Side  expression  of  (C.  1-6). 
Thus, 


PK,yltkst<,k+l}=: 


yy  .-c<N-k)(t-.)t  (s)  as 

Jo 


p 


•;  OSyct 


(D.2-4) 


l;  yat 
0;  y<0 


Substituting  (D.2-4)  into  (D.  1-6)  gives 

p{y«)  =  n,,kS,<tk+1} 


y*  PN.k(t-y,0,n)e'c(N'kK,'y)  f,  (y)dy 


Pltkst<tk+l) 


(D.2-5) 


When  k  =  0, 


P{Y(t)  =  nJOSKtj}  =  pN(t,0,n), 


(D.2-6) 
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since  tQ  =  0.  Substituting  (D.2-5)  and  (D.2-6)  into  (D.l-3)  gives 


P{Y(t)  =  n}  =  e'Nct  pN(t,0,n) 


N  _t 


f  PN-k(t'y,0,n)e'C(N  kKt  Y)ft  (y)dy 


k=l  o 


(D.2-7) 


or  changing  summation  index, 

P{Y(t)  =  n}  =  e'Nct  pN  (t,0,n) 
N  - 1  _  t 


I  /  Pi(t-y,0,n)e'cj(t'y)f  (y)dy 

j=o  N-i 


(D.2-8) 


From  equation  (C. 2-7) ,  it  follows  that  for  0^y<N-l, 


ft  (y)  =  £ 

N-j 


dy  i 


i  (<) e  cly  (i-e*cy>N  i 


i=0 


*  2  2  (W'-"*"  oa», .-•««>».  y,0, 

Jf  =0  m=0 


(D.2-9) 


so  that (D.2-8)  becomes 


(-l)m+1  c(l+m) 


P{Y(t)  =  n}  =  e'Nct  pN(t,0,n) 


N-l  j  N-£  r 

*  2  2  2  (?)  (V) 

j=0  i=0  m=0  •- 

■  f  Pjd-y.O.iDe-'X^^a-yady 


<(D .  2- 10) 


1 

J 
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or  changing  variables  in  the  integral; 


P{Y(t)  =  n}  =  e'Nct  PN(t,0,n) 


j=0  1=0  m=0  L 


Pj(s,0,n)e 


ds 


(D .  2-11) 


Equation  (D.2-11)  actually  gives  the  probability  of  the  system  being  in  state  n 
at  time  t  conditioned  on  starting  in  state  0  (full-up)  at  time  0,  i.e. 

P{Y(t)  =  n|Y(0)  =  o}.  Often,  however,  it  is  necessary  to  consider 
P(Y(t)  =  n|Y(0)  =  i}  for  some  arbitrary  ie{0,  1,  ....  J}.  But  the  computa¬ 
tions  are  nearly  the  same  with  the  result  being 


P{Y(t)  =  n|Y(0)  =  i}  =  e'Nct  pN(t,i,n) 


E  (N)  (N-t)  (_i)m+l  o(Wm)< 
j=0  £=0  m=0  L 


-c( £+m)t 


p.(s,0,n)e 


-c(j-£-m)sds 


(D.2-12) 


where  (n,i)e{0,  lj  ...,  J}.  In  deriving  (D.2-12)  it  should  be  pointed  out  that 
Y(tj)  =  ...  =  Y(t^)  =  0  is  still  assumed. 

Equation  (D.2-12)  specifies  the  entire  probability  structure  of  the  combined 
HW/SW  reliability  model  (when  adapted  to  the  Goel/Okumoto  SW  model)  necessary 
to  derive  the  availability  measure  in  Section  3.3.  A  computer  program  has  been 
developed  to  compute  (D.2-12)  when  i  =  0.  This  program  is  documented  in 
Appendix  F.  The  availability,  assuming  Y(0)  =  0,  is  given  by  (3.3-1)  with 
P(Y(t)  =  m}  computed  from  (D.2-12)  with  i  =  0. 

In  order  to  derive  the  measures  (3.3-2),  (3.3-3),  and  (3.3-4)  it  will  be 
necessary  to  derive  P{Y(t)  =  n|Y(0)  =  i}  under  the  provision  that  all  failed 
states  be  made  absorbing  states.  The  expression  (D.2-12)  cannot  be  used 
directly  since  its  derivation  depends  on  the  tacit  assumption  that  the  full -up 
state  can  be  reached  from  any  other  state. 


For  the  moment,  assume  that  the  transition  diagrams  and  infinitesimal 
matrices  Aj  have  been  modified  so  that  all  failed  states  are  combined  into  one 
absorbing  state  iF»  and  suppose  that  pj(u,l,m),  the  solutions  to  (B.  1-7),  have 
been  derived  under  this  provision.  Suppose  also  that'  n  and  i  are  operational 
states.  Given  that  Y(0)  =  i  and  that  tkst<tk+l  Y(t)  =  n  can  only  happen  if  Y 
is  not  absorbed  in  any  of  the  intervals  (0,  tj],  (t i ,  t2)>  •••  (tk-l»  tk]  and 
Y  successfully  transitions  to  n  at  time  t  starting  from  time  tk.  Using  the 
notation  of  Section  D.l,  it  follows  that  for  kil  and  Y(0)  =  i 


P(Y(t)  =  n|tk<t<tk+1 


=/•••  f  [l-PN(sri,iF)  l-pN.1(s2,0,iF)  ...  l-pN 


k+l(sk’0,1F)  j 


pN-k^-  I  Sf°A  fT, . V1l{V,<tk.l/!l 

where  the  integration  is  performed  over  the  region 


(s1,...,sk+1)ds1ds2...dsk+1 


(D.2-13) 


yr°-  •••■  sk+i20’  £  sist<  I  sr 

1=1  i=i 


When  k  =  1,  (D.2-13)  reduces  to 


P  { Y  ( t)  =  n  1 0<tct  2 }  =  pN(t,i,n). 


(D . 2- 13a) 


Combining  these  results  with  (D.2-2)  gives 


P{Y(t)  =  n | Y( 0)  =  i}=pN(t,i.n)  e‘ 


^  ( s 2 > ip ) 


n  [1-PN-j+i(sj’°’iF>]  PN-k  1  ■  X  sr0, 

i=2  \  1=1 


•  J~|  [c(N -j+1) ]  exp  (-s.-c(N-j+l)}  ds. 

i=i 
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where  the  integration  is  performed  over 


k  k+1 

sr° . sk+r°-  S  s*5t<  X  sf 

1=1  {=1 


It  should  be  emphasized  that  (D.2-14)  differs  from  (D.2-12)  because  (D.2-14) 
is  derived  under  the  condition  that  all  failed  states  are  collapsed  into  one 
absorbing  state  ip  whereas  the  assumption  in  (D.2-12)  is  that  no  states  are 
absorbing. 

Using  (D.2-14)  in  (3.3-2)  and  (3.3-3)  will  yield  the  probability  of  failure- 
free  operation  starting  at  i  and  of  duration  t,  and  the  MTTFj,  respectively. 

The  non -stationary  reliability  coefficient  defined  in  (3.3-5)  is  very  difficult 
to  compute  but  the  steady  state  reliability  coefficient  R(®,  t0)  defined  in  (3.3-6) 
can  be  computed  if  it  can  be  shown  that  for  the  baseline  model , 

lim  P{Y(t)  =  n  |  Y(0)  =  i}  s  7r(n) 

t  -oo 

exists  (n  =  0,  1,  . . . ,  J)  and  is  independent  of  i.  To  see  that  7r(n)  is  well  defined 
for  the  baseline  model  consider  expression  (D.l-3)  under  the  provision  that 
Y( 0)  =  i.  Because  of  the  baseline  model  assumptions, 

P|‘kst<tk+l}  =  (wNk)  e'C(N'kM  (l-e-c,)k 

for  k  =  0,  1 . N  (this  follows  from  the  comment  after  expression  (D.l-3) 

and  expression  (D.2-1)).  Unless  k  =  N  ,  this  latter  expression  -0  as  t— 
so  that  the  only  term  in  (D.l-3)  which  can  contribute  to  the  limit  is  the  term 
with  k  =  N,  i.e.  if  the  limit  exists,  then 


lim  P{Y(t)  =  n|Y(0)  =  i}  =  lim  PiY(t)  =  n|tN<t |P|tN<t ). 

t  —CO  t  ‘  *-CD 

Using  (D.2-5),it  follows  that 


PjY(t)  =  n|tNstjpjtN<t  }  =  /  p0(t-y,0,n)f  (y)dy. 

Jo  N 

Since  the  state-space  for  Y  is  finite,  then  if  the  full-up  state  can  be  reached 
from  any  other  state  in  a  finite  number  of  transitions  when  X(t)  =  0  (i.e.,  the 
transition  diagram  and  Aj  discussed  in  Section  3.2  are  such  that  when  ]  =  0,  the 
full -up  state  can  be  reached  from  any  other  state  which  is  not  absorbing  in 
a  finite  number  of  transitions)  then 

lim  pQ(t,0,n)  =  v(n) 
t—® 
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where  v(n)  satisfies  the  algebraic  equations 


J 

v(k)  AQ(k,n) 

k=0 

J 

£  v(k)  =  1 
k=0 


=  0;  n  =  0,  . . . ,  J 


> 


J 


(D.2-15) 


(Kannan  1979,  p.  136).  This  condition  is  usually  met  in  practice  (i.e.,  that 
full-up  is  reachable  from  any  other  nonabsorbing  state) ,  and  will  be  assumed 
to  be  true  in  what  follows. 

Write 


0,n)f  (y)dy 
lN 


as 


gn(t'y)ft  (y)dy 
n  lN 


where 


gn(t-y)  =  { 


0;  if  y>t 

pQ(t-y,0,n)  ;  if  Osyst. 


Since 


|gn<t‘y>ft  <y)  ~ft  (y) 

|  lN  lN 

and  ftN  is  integrable  on  (0,®),  it  follows  from  the  Lebesgue  Dominated 
Convergence  Theorem  that 


lim 

t— CO 


/  P0(t-y,0,n)ft^(y)dy  =  J 


/CD 

v(n)f 

lN 


(y)dy  =  v(n) ,  n 


lim  g  (t-y)L 
t n  N 


0,  1, 


J. 


(y)dy 


It  has  thus  been  shown  that 


lim  P{Y(t)  =  n  |  Y(0)  =  i}  =  *(n)  =  v(n)  (D.2-16) 

t  -«> 

where  v(n)  is  defined  by  (D.2-15). 

From  these  considerations  it  is  seen  that  "steady  state"  conditions  under 
the  baseline  model  entail  the  eventual  complete  debugging  of  the  SW .  This 
does  not  preclude  the  possibility  of  a  SW  failure  due  to  some  other  cause 
instead  of  a  bug,  however.  This  situation  can  be  modeled  by  including  an  SW 
component  in  the  transition  diagram  which  has  constant  failure /repair  rate 
independent  of  the  number  of  bugs  in  the  SW. 

To  compute  R(®,  tQ)  (i.e.  (3.3-8))  it  is  necessary  to  compute  Pm(t0)  (see 
(3.3-4)  and  (3.3-6))  under  "steady  state"  conditions.  The  necessary  expression 
is 


R(®,  tQ)  =  7r(m)  pQ(to,m.n)  (D.2-17) 

where  the  double  summation  is  taken  over  all  (m,n)  such  that  m  and  n  are 
operational  states  in  the  transition  diagram  corresponding  to  Ao  (i.e. 

X(t)  =  0),  and  where  p0(to,m,n)  is  computed  under  the  provision  that  failed 
states  be  made  absorbing. 
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APPENDIX  E 


NUMERICAL/COMPUTATIONAL  ASPECTS  OF  THE  COMBINED 
HW/SW  RELIABILITY  MODEL 

A  technique  for  computing  expression  (D.2-12)  will  now  be  discussed.  To 
implement  this  technique,  it  is  necessary  only  to  numerically  solve  systems  of 
linear  equations  and  linear  differential  equations.  No  integrations,  as  indicated 
in  (D.2-12)  are  necessary. 

In  computing  (D.2-12)  the  major  difficulty  is  in  computing 
.t 


/ 


e-c(j-!-m)s  p.(s>0)n)dS)  n  =  o . J. 


For  any  admissible  (j,£,m)  these  quantities  can  be  computed  by  solving  a 
system  of  linear  equations.  To  see  this,  define 


x(t,j,l,m)  s 


e-o(M-m)s  p.(s,0,0)ds 


l 

[\cQl 

Jo 


m)s 


Pj(s,0,  l)ds 


£  e-c(j-f-m)s  p.(S(0(J)ds 


(E.l-1) 


and 


Pj(t)  = 


Pj(t.O.O)] 


Pj(t ,  0, 1) 


Pj(t,0,J) 


(E.l-2) 


With  this  notation,  the  Kolmogorov  forward  differential  equations  (see  (B.l-7)) 
can  be  written  as 


4  Pj(s)  =  a!  Pj(s) 


(E.l-3) 


E-l 


i 


1 


where  Aj  is  the  infinitesimal  matrix  (see  Appendix  B)  corresponding  to 
X(t)  =  j,  and  '  denotes  transpose.  The  initial  conditions  for  (E.l-3)  are 
Pj(0,0,0)  =  1,  pj(0,0,j)  =  0,  j  4  0.  Denote  by  1  a  column  vector  of  l's  whose 
dimension  will  be  clear  from  context.  As  usual,  I  will  denote  an  identity  matrix, 
the  dimension  of  which  will  be  clear  from  context. 

Multiplying  (E.l-3)  by  e  c  ^  *  m^s  and  integrating  from  s  =  0  to  s  =  t  gives 

-c(j-i-m)s  .  .  . ' 

e  J  d  P.(s)  =  A. 

and  integration -by -parts  on  the  left  side  yields 

Pj(t)e'c(j"*'m)t  -  P.(0)  =  ^A.'  -  e(j-i-m)lj  J  e"c(^i‘m)s  P.(s)ds 
or,  writing 

b(t,j,j?,m)  s  P.(t)e~c(j~*'m)t  -  ^(0) 

the  linear  system  of  equations  (t  is  assumed  fixed) 

^A-  -  c(j-*-m)lj  x(t,j,l,m)  =  b(t,j,l,m)  (E.l-4) 

is  obtained.  This  system  of  equations  is  singular  when  j-i-m  =  0  and  over¬ 
determined  otherwise.  These  problems  are  caused  by  the  fact  that  the  com¬ 
ponents  of  the  vector  (E.l-1)  must  sum  to  1  identically  in  t,  i.e. , 


I 


-c(j-i-m)s  g- 


P-(s)ds 


V  Pj ( t )  s  1. 


Because  of  this, 


x(t,j,f,m)  =  f  e  *  m)s  1'  P.(s)ds 
Jo  J 

=  /  ds.  K(t.j.l.m) 

Jo 


I 
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where 


K(t  ,j  ,l,m) 


t;  if  j-i-m  =  0 


_c(Flrm)  ;  if  j'i‘m  *  °- 


(E.l-5) 


The  problems  can  be  removed  by  reducing  the  order  of  the  system  (E.l-4)  by 
directly  imposing  the  relation  r  x(t,j,i,m)  =  K(t,j,!,m)  in  (E.l-4).  To  do  this, 
the  following  notation  is  needed. 

Let  A(j,l,m)  be  the  J  x  J  matrix  obtained  from  Aj  -  c(j-l-m)I  by  deleting 
the  last  xow  and  last  column,  and  let  H(j,|,m)  be  the  J  x  1  vector  obtained 
from  the  last  column  of  Aj  -  c (j-l-m) I  by  deleting  the  last  element.  Similarly, 
define  y(t,j,f,m)  to  be  the  J  x  1  vector  obtained  from  2(t,j,|,m)  by  deleting 
the  last  element,  and  B(t,j,Jl,m)  will  be  the  J  x  1  vector  obtained  from 
b(t,j,!,m)  by  deleting  the  last  element.  The  new  system  then  becomes 


(A(j,f,m) 


H(j,l,m)l')y(t,j,lm)  =  B(t,j,|m)  -  K(t,j,!m)  H(j,!,m). 

(E.l-6) 


If  y(t,j,l,m)  solves  (E.l-6),  then  x(t,j,l,m)  is  given  by 


x(t,),l  ,m) 


>i(t,i,f,m) 

K(t,j,I,m)  -  1'  y(t,j,!,m) 


(E.l-7) 


Thus,  having  obtained  Pj(t)  for  some  fixed  t's  of  interest  bv  solving  the 
system  (E.  1-3)  of  linear  differential  equations,  the  vector  (E.  1-1)  is  computed 
by  solving  the  system  (E.l-6)  and  using  (E.l-7).  The  state  occupancy  prob¬ 
abilities,  i.e.  (D.2-12),  are  then  easily  computed. 
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APPENDIX  F 


COMPUTER  PROGRAM  FOR  CALCULATING  STATE  OCCUPANCY  PROBABILITIES 


A  computer  program  has  been  developed  for  computing  (D.2-12)  (with 
i  =  0) .  This  program  uses  the  numerical  methods  described  in  Appendix  E  and 
was  used  for  computing  the  examples  used  in  this  study.  A  source  listing  is 
included  in  this  appendix  along  with  a  detailed  flow  diagram. 

The  program  is  written  in  FORTRAN  IV  and  designed  to  run  on  the  IBM 
370  or  AMDAHL  470  computer.  To  use  the  program,  it  is  necessary  to  change 
lines  70,  80,  90,  and  120  in  the  main  to  reflect  the  values  of  A,  p,  and  J 
(J  =  NS  in  the  program  and  is  the  number  of  states),  respectively.  The  user 
supplies  the  value  of  NTIME  which  is  the  last  time  point  (integer)  at  which 
the  probabilities  are  computed.  This  value  is  read  on  unit  5. 

In  addition,  the  user  must  change  lines  120  through  300  in  subroutine 
SYS  to  reflect  the  infinitesimal  matrix  desired.  The  matrix  A  in  this  subrou¬ 
tine  is  related  to  the  infinitesimal  matrix  Aj  used  previously  in  the  following 
fashion  (JJJ  is  the  computer  name  for  j  in  SYS): 

A(x+1  ,y+l)  =  A.(y,x) 

t  t 

In  subroutine  SYS  Infinitesimal  Matrix. 

Since  the  computer  array  cannot  have  a  subscript  value  of  0,  the  state  values 
are  shifted  one  unit  (e.g.  state  0  is  state  1  in  the  computer  program). 

A  simplified  flow  chart  for  the  program  is  included  in  Figure  F-l.  Figure  F-2 
is  the  detailed  flow  diagram . 

A  detailed  error  analysis  was  performed  to  determine  the  accuracy  of 
the  results  obtained  using  this  program.  The  program  is  not  intended  for 
use  with  N  (number  of  SW  bugs  initially)  large.  How  large  N  can  be  depends 
on  how  many  time  points  (NTIME)  are  calculated.  When  N  and  NTIME  become 
large  underflows  and  overflows  occur  in  the  exponential  function  DEXP  and 
in  the  computation  of  combinatorials  used  in  D.2-12.  Barring  these  difficulties, 
the  error  analysis  showed  that  the  program  outputs  had  relative  error  less 
than  an  upper  bound  on  the  order  of  10' ^  for  the  cases  considered.  These 
cases  included  NS  =  2,  3,  5,  7;  NTIME  =  6,  6,  ....  15,  N  =  5,  10,  and  values 
of  repair  rates  on  the  order  of  1  or  2,  and  SW/HW  failure  rates  on  the  order 
of  0.001  to  1.0,  and  c  (=Ap)  =  0.95. 

The  major  sources  of  error  are  in  DSDIFF  (numerical  solution  of  differen¬ 
tial  equations)  and  LINEQ  (linear  equation  solver).  The  algorithm  used  in 
DSDIFF  is  that  of  Bulirsch  and  Stoer  from  the  journal  Numerische  Mathematik, 
vol  8  (  1966)  in  an  article  entitled  "Numerical  Treatment  of  Ordinary  Differential 
Equations  by  Extrapolation  Methods."  The  relative  error  for  the  solution  vectors 
is  controlled  to  be  less  than  10-13  in  this  program  (see  line  160  in  subroutine 
SOLVE).  With  this  error  reasonably  controlled  the  next  likely  source  is  in  LINEQ. 


The  algorithm  used  in  LINEQ  is  Gauss-Jordan  reduction  and  the  error  in  using 
this  technique  (and  any  other  technique)  is  dependent  on  how  "close"  the  sys¬ 
tem  of  equations  is  to  being  singular.  To  quantify  this,  it  is  necessary  to  intro¬ 
duce  matrix  norms.  The  reference  for  this  material  is  (Burden,  et.  al.). 

Suppose  the  system  of  linear  equations  under  consideration  is 

Ax  =  b  (F.  1-1) 


where 

A  is  nxn,  x  is  nxl,  and  b  is  nxl. 

For  an  arbitrary  matrix  B  of  dimension  pxq ,  define  the  norm  of  B  as 

q 

IIBII  =  max  S'  |b..|  (F.  1-2) 

1  <•  ;  <  n  1  *1 
"  P  3=1 

where 

bjj  is  the  (i,j)  element  of  B.  What  is  needed  is  to  find  a  bound  for 
II  x  -  xll  /  II X II 

where  x  is  the  exact  solution  of  (F.l-1)  and  x  is  the  solution  obtained  from 
the  Gauss-Jordan  reduction  algorithm. 

When  A  is  non-singular  (as  is  the  case  in  the  program)  it  can  be  shown 

that 


lx  -  x||  /  || X ||  s 


IIAII  .  IIA'1  II 


lb- Ax  I 


lib  II 


(F.  1-3) 


The  quantity  K(A)  =  II  A  lb  IIA'l  II  is  called  the  "condition  number"  of  the 
matrix  A.  The  condition  number  is  always  greater  than  or  equal  to  1  and  its 
magnitude  measures  how  well  the  system  of  linear  equations  behaves  in 
terms  of  numerical  solution.  For  practical  purposes  compution  of  IIA' 1 II  is  the 
same  as  solving  the  linear  system  and  is  subject  to  as  much  or  more  computa¬ 
tional  inaccuracy.  To  make  computation  of  the  condition  number  practical, 
the  following  technique  can  be  used. 


The  first  step  is  to  compute  x  using,  say,  t- digit  arithmetic  and  Gauss- 
Jordan  reduction.  The  residual  vector  b-Ax  is  then  computed  in  2t-digit 
arithmetic.  Gauss-Jordan  elimination  is  then  applied  to  the  system  Ay  =  (b-Ax) 
to  yield  the  solution  y.  The  approximate  value  of  K(A)  is  then 
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K(A)  *  JLZL  •  10*  (F.  1-4) 

II  x  It 

yielding 

||x  -  xll  / 1|  x  ||  s  iot 

llxll 

These  techniques  were  applied  to  the  computer  program  listed  in  this 
appendix  by  adding  an  extended  precision  subroutine  (i.e.  REAL*16)  to 
perform  Gauss-Jordan  elimination  and  hence  perform  the  necessary  2t-digit 
calculations.  The  maximum  relative  error  (over  all  cadis  to  LINEQ)  was  on 
the  order  of  10'  10.  This  combined  with  the  controlled  error  in  the  sub¬ 
routine  DSDIFF  indicated  excellent  precision  for  the  cases  mentioned  earlier. 
The  user  of  this  program  should  be  warned,  however,  that  gross  computa¬ 
tional  errors  will  occur  if  NTIME  and/or  N  are  too  large. 


Ilb-ASEB 
lib  II 


(F.  1-5) 
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AT  TIME  POINTS 

1.2,.  .  ..  NTIME  FOR 

p 

COMPUTE  MULTIPLE 
SUM  IN  (D.2-12) 

FOR  EACH  STATE 

- 

PRINT  RESULTS 
(I.E.  STATE  OCCUPANCY 
PROBABILITIES  AT  TIMES 
0,1,2 . NTIME 

N  AND 

STORE  VALUES 

1 

SOLVE  LINEAR  SYSTEM 
(E.l-6)  FOR  EACH 

M=0,1 . N-C;  6=0,1,  .  .  .  j; 

(A)  USES  SUBROUTINES  SOLVE,  DERI V,  DSDIFF 

(B)  USES  SUBROUTINE  LINEQ 

1=0,1 . N-l;  t=l,  .  .  NTIME 

AND  STORE  VALUES 

H 


Figure  F-l.  Simplified  Flow  Chart  for  Computer  Program 
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Figure  F-2.  Detailed  Flow  Diagram 

Each  consecutive  three  pages ,  placed  bottom  to  top ,  make  up  one  page  in 
the  complete  flow  diagram.  The  reference  numbers  at  the  upper  right  corner 
of  each  box /triangle  is  a  sequence  number  to  identify  the  box /triangle.  The 
numbers  A.B  refer  to  page  A,  box/triangle  number  B.  For  example,  4.01  is 
page  4,  box  1.  Page  numbers  are  placed  at  the  upper  right  corner  on  the 
first  of  each  three-page  set. 
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BEGIN  DO  LOOP 
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20  12  =  I»  II 


CHART  TITLE  -  SUBROUTINE  OERlV(X,Y,OT 1 


D£GIN  DO  LOOP 


A(Il,I2>*t<  12  I 


CM  |/0t7  31111/ 


/LIME  270/ 


01/JO  ill  it  PAGE  14 


EPS  AliO  SIVECIOR1  APE  I  I  /DSUI  fcOO/1 

THE  EPPOR  GCUI.OS.  -USUI  040/1  04  I  » - 

DTt>GI  EPS  l  SMO'JIO  CE  -  I  I  El'SERR  =  .TRUE. 

GJ  SriAllER  TIIAII  «  HUT  «  I  • - 


WR IH  TO  DtV 


F-26 


01/ JO/81 


bEGIIJ  UO  LOO? 


/US011150/* 


YGC  J  J 


C-OBUIFIOAT  H 
IKD.Hl.Ol)  H 


01/ JO/3 »  PAGE  16 


F-31 


f  II  I  I  >041101  m  I  -*  TAIGt 

IA, III, 0(1  HI  I  »  IAI  .or.  » - 

H  I  I  «f0A05t  SI  I  l» 


-EQ. 


1 


IMPLICIT  REAL*8( A-H.O-Z) 

MAIN 

10 

REALMS  SI (20 1/1.00) 1940.00/ 

MAIN 

20 

REAL*8  COMBI  21,21 )/21*l . 000,420*0 . 0D0/ 

MAIN 

30 

REAL*8  PP( 20,21,50) 

MAIN 

40 

REAL*8  ARRAYAC 20,20) 

MAIN 

50 

REAL*8  X(20!,AA(20, 21), SS, LAMBDA 

MAIN 

60 

LAM30A=1 . 000 

MAIN 

70 

P= . 9500 

MAIN 

80 

NS=7 

MAIN 

90 

c 

NS=TH£  NUMBER  OF  SYSTEM  STATES 

MAIN 

100 

READ  (5,*)  NTIME 

MAIN 

110 

N=10 

MAIN 

120 

c 

N-THE  NUMBER  OF  SOFTWARE  BUGS  INITIALLY 

MAIN 

130 

c 

NTIME  IS  THE  ENDING  TIME  FOR  THE  CALCULATIONS. 

MAIN 

140 

c 

THE  PROBABILITIES  WILL  BE  PRINTED  AT  T=0,l,2,.. 

. .NTIME . 

MAIN 

150 

c 

LIMITATIONS  ARE:  NTIME<=50,NS<=20,N<=20.  TO  CHANGE  THESE 

MAIN 

160 

c 

IT  IS  NECESSARY  TO  CHANGE  SOME  DIMENSIONS. 

MAIN 

170 

CALL  CCM3IN  (COMB.N+l) 

MAIN 

180 

C  =  P*LA)-',DA 

MAIN 

190 

T  =  0 . 0D0 

MAIN 

200 

10=0 

MAIN 

210 

NDXMsNS-1 

MAIN 

220 

HN=NS-1 

MAIN 

230 

MM  =  1 

MAIN 

240 

CALL  SOLVE  ( NS, NTIME ,N,PP ) 

MAIN 

250 

SUM-0 . CDO 

MAIN 

260 

DO  10  12=1, NS 

MAIN 

270 

10 

SUM=SUM+S1(I2> 

MAIN 

280 

WRITE  (6,160)  T,(S1(IW),IW=1,NS),SUM 

MAIN 

290 

DO  150  1=1, NTIME 

MAIN 

300 

T=DcLOAT(  I  ) 

MAIN 

310 

TT  =  T 

MAIN 

320 

JJ=N-1 

MAIN 

330 

DO  20  11=1, NS 

MAIN 

340 

20 

Sl< 11 )=DEXP( -C*DFLOAT< N )*TT >*PP( 11 ,N»1 , 1+1 ) 

MAIN 

350 

DO  130  J=IO,JJ 

MAIN 

360 

CO  120  L=IO, J 

MAIN 

370 

IU  =  N-L 

MAIN 

380 

DO  110  M=IO,IU 

MAIN 

390 

DO  30  L1=1,HS 

MAIN 

400 

DO  30  L2=l ,NS 

MAIN 

410 

30 

APPAYA( LI , L2 )=0 . 000 

MAIN 

420 

CALL  SYS  (AP^AYAtJ) 

MAIN 

430 

Z=-C*DFLOAT( J-L-M) 

MAIN 

440 

DO  40  11=1, NS 

MAIN 

450 

APPAYAC 11 ,11  )=ARRAYA( 11,11  )+Z 

MAIN 

460 

AA(  I1,N3)=DEXP(Z*TT)*PP( I1,J+1,I+1> 

MAIN 

470 

40 

CONTINUE 

MAIN 

480 

AA(  1 ,NS )  =  AA( 1 ,NS )-l . 0D0 

MAIN 

490 

CONST=TT 

MAIN 

500 
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IF  (Z.NE.O.ODO)  CONST=( 1 .DO/Z )»( DEXPI Z*TT )-l .DO ) 

MAIN 

510 

N51=NS-1 

MAIN 

520 

DO  60  11=1, NS1 

MAIN 

530 

DO  50  12=1, NS1 

MAIN 

540 

so 

ARRAYAt 11 ,12 )=ARRAYA( 11 ,12 )-ARRAYA( 11 ,NS ) 

MAIN 

550 

60 

CONTINUE 

MAIN 

560 

DO  70  11=1, NS1 

MAIN 

570 

70 

AA< 11 ,NS)=AA< 11 ,NS ) -CONST*ARRAYAt 11 ,NS  > 

MAIN 

580 

00  60  11=1, NS1 

MAIN 

590 

DO  80  12=1, NS1 

MAIN 

600 

SO 

AA<  11,12  )=ARRAYA(U>  12  > 

MAIN 

610 

CALL  LINE9  (AA.NN.X) 

MAIN 

620 

SS=0.000 

MAIN 

630 

DO  90  11=1, NS1 

MAIN 

640 

90 

SS=SS+X(I1> 

MAIN 

650 

X( NS )  =CCNST -S5 

MAIN 

660 

C 

WRITE) 6,*)  ( X( LL ) , LL=1 ,NS ) , I , J , L,M 

MAIN 

670 

DO  100  K=1,NS 

MAIN 

680 

SI  ( K  )=S1(  K ) +COMB< N+l , L+l )*COMB( IU+1 ,M*1 )*<<-! )«*( H+l ) )*C*<  L+M )*OEXMAIN 

690 

1P(-C*(L+M)*TT)*X(K> 

MAIN 

700 

100 

CONTINUE 

MAIN 

710 

110 

CONTINUE 

MAIN 

720 

120 

CONTINUE 

MAIN 

730 

130 

CONTINUE 

MAIN 

740 

SUM=O.ODO 

MAIN 

750 

DO  140  12=1, NS 

MAIN 

760 

140 

SUM=SUmSl(I2) 

MAIN 

770 

150 

WRITE  (6,160)  T, (Sl(IW), IW=1,NS), SUN 

MAIN 

760 

C 

OEBUG  INIT(  SI ,  I ,  J ,  L,(1 ) 

MAIN 

790 

STCP 

MAIN 

eoo 

C 

MAIN 

810 

C 

MAIN 

820 

C 

MAIN 

830 

160 

FORMAT  (1X,F3.0,2X,8E14.6) 

MAIN 

840 

end 

MAIN 

850 

n  n 


SUBROUTINE  COMB IN  (COMB. ID) 

THIS  ROUTINE  COMPUTES  I  COMBINATORIAL  J  FOR 
. . id;j=o,i,.  ...id. 

IMPLICIT  REAL*8( A-H  >0-Z ) 

REALMS  CCMB( 21.21) 

COMBI  2 ,2  )=1 . 000 
DO  20  1=3,10 
DO  10  J=2 , IQ 

COMBI  I, J  )=COMB( 1-1, J-l l+COMBI I-l.J) 

10  CONTINUE 

20  CONTINUE 

RETURN 
END 


COMB  10 
COMB  20 
COMB  30 
COMB  40 
COMB  50 
COMB  60 
COMB  70 
COMB  60 
COMB  90 
COMB  100 
COMB  110 
COMB  120 
COMB  130 
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SUBROUTINE  SYS  CA.JJJ) 

SYS 

10 

THIS  ROUTINE  SUPPLIES  THE  MATRIX  AJ 

SYS 

20 

CI.E.  THE  INFINITESIMAL  MATRIX.  SEE  FINAL  REPORT)  FOR 

SYS 

30 

VALUES  OF  J=0,l,2 . N. 

SYS 

40 

WHEN  MODIFYING  THIS  ROUTINE  FOR  SPECIFIC  CASES 

THIS 

SYS 

50 

ROUTINE  USES  A  IN  TRANSPOSED  FORM.  SO,  FOR  EXAMPLE, 

SYS 

60 

THE  CI,J)  ENTRY  IN  THE  INFINITESIMAL  MATRIX  AS 

DESCRIBED 

SYS 

70 

IN  THE  FINAL  REPORT  IS  ENTERED  IN  THIS  ROUTINE 

AS  THE 

SYS 

80 

<J,II  ELEMENT  OF  A. 

SYS 

90 

IMPLICIT  REAL*8( A-H.O-Z) 

SYS 

100 

REAL*'8  AC  20 ,20) 

SYS 

110 

A( 1 ,1 >  =  -< .C5DO*OFLOAT<  JJJ  )♦ .0200 ) 

SYS 

120 

AC  2,1 >=.35DO»OFLOATC JJJ 1 

SYS 

130 

AC  3,1 )=.02C0 

SYS 

140 

ACl, 21-1.00 

SYS 

150 

AC  2 , 2 )  =  -l -DO 

SYS 

160 

AC  1 >  3  >  =  1 ■ DO 

SYS 

170 

AC  3  >  3  >  =  -C  1 .01800* . 045*0FLOAT<  JJJ )  ) 

SYS 

180 

AC  4 , 3  >  =  . 04SD0«DFLOATC  JJJ I 

SYS 

190 

AC  5 , 3  )  =  .01800 

SYS 

200 

AC  3,4 )  =  1 .DO 

SYS 

210 

AC  4,4 )  =  -l .DO 

SYS 

220 

AC  3,51=1. DO 

SYS 

230 

AC  5 ,5 1  =  -( 1 . 01600*. 04DO»OFLOAT( JJJ)) 

SYS 

240 

A(6,5)=.04D0<3FLOATC JJJ) 

SYS 

250 

AC  7,31  = . C16D0 

SYS 

260 

AC  5,8  1  =  1. DO 

SYS 

270 

AC  6 ,6 l=-1.00 

SYS 

280 

ACS, 71  =  1. DO 

SYS 

290 

A(7,7)  =  -1.D0 

SYS 

300 

AC  1 , 1  1  =  —  C  .004  +  DFLOATC  JJJ ) ) 

SYS 

310 

ACl,  21  =  2.00 

SYS 

320 

AC  1,31  =  2. CO 

SYS 

330 

AC  2 , 1 )=C FLOAT) JJJ ) 

SYS 

340 

AC  2 , 2  )  =  -2 . 00 

SYS 

350 

AC  3,1 )= .00400 

SYS 

3o0 

AC  3 , 3 )  =  -2 .DO 

SYS 

370 

RETURN 

SYS 

330 

END 

SYS 

350 
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10 


CO 

30 

c 

40 

50 

C 


SUBROUTINE  SOLVE  ( NS.NTIME ,N, PP ) 

SOLV 

10 

THIS  ROUTINE  SOLVES  THE  SYSTEM  OF  DIFFERENTIAL 

SOLV 

20 

EQUATIONS  (KOLMOGOROV  EQUATIONS)  FOR  EACH 

J  ANO  AT 

SOLV 

30 

EACH  POINT  IN  TINE.  THE  MATRIX  PPd.J.IT) 

UPON 

SOLV 

40 

EXIT  WILL  CONTAIN  THE  SOLUTION  FOR  STATE  I 

,  WHEN 

SOLV 

50 

X(T)  =  J,  AT  THE  POINT  T=IT-1),  I.E.  IN  THE 

NOTATION 

SOLV 

60 

OF  THE  REFORT,  PJ< IT-1,0,1 t.  IT  IS  ASSUMED 

THAT 

SOLV 

70 

Y( 0  )-0  WITH  FR03ABILITY  1. 

SOLV 

eo 

IMPLICIT  REAL*8<  A-H ,0-Z ) 

SOLV 

90 

EXTER)  .L  OERIV 

SOLV 

100 

REALMS  PP(20,C1,50),Y(20),DY(20),SI20) 

SOLV 

110 

LOGICAL  NEU:t 

SOLV 

120 

CCMMON  /DO/  J.NSS 

SOLV 

130 

H= . ICO 

SOLV 

140 

11=1 . DO/H+ . 1 

SOLV 

150 

EPS  =  1 .0-13 

SOLV 

160 

NS3=(,'S 

SOLV 

170 

N1=M+1 

SOLV 

180 

N2=NTIME+1 

SOLV 

190 

00  50  J=1 ,N1 

SOLV 

200 

PP( 1, J,1 )=1.0D0 

SOLV 

210 

Y( 1  )-1.000 

SOLV 

220 

S(  1 )  =  0 . CDO 

SOLV 

230 

DO  10  11=2, NS 

SOLV 

240 

Y1 11  )=0. CDO 

SOLV 

250 

S( 11 )=0 . 000 

SOLV 

260 

PP(I1,J, 11=0.000 

SOLV 

270 

CONTINUE 

SOLV 

280 

T1  =  0 . ODO 

SOLV 

290 

DO  40  IT=1,NTIME 

SOLV 

300 

OO  20  12=1,11 

SOLV 

310 

H=.1D0 

SOLV 

320 

CALL  OSDIFF  ( OERIV, NSS,H ,T1 ,T ,EPS,S,NE«H ) 

SOLV 

330 

DO  30  11=1, NSS 

SOLV 

340 

PP(  11 »U, IT+1 >=Y(I1 ) 

SOLV 

350 

NRITEf  6,»  >  (  Y(K),K=1,NSS),T1,NEWH,H 

SOLV 

360 

CONTINUE 

SOLV 

370 

CONTINUE 

SOLV 

380 

DEBUG  INIT(Y) 

SOLV 

390 

RETUT  M 

SOLV 

400 

END 

SOLV 

410 

o  o 


SUBROUTINE  OERIV  (X.Y.OY) 

DERI 

10 

THIS  ROUTItiE  COMPUTES  THE  DERIVATIVES  REQUIRED  BY 

DERI 

20 

DSOIFF . 

DEPI 

30 

IMPLICIT  REAL«6(A-H.0-Z) 

DERI 

40 

R£AL*3  Y( 20  )  ,DY< 20  )  >A<  20 , 20 1/400*0 . ODO/ 

DEPI 

EO 

COMMON  /DO/  J.NSS 

DEPI 

60 

JJ  =  J 

DEPI 

70 

CALL  SYS  (A.JJ-l) 

DERI 

63 

00  10  11=1, NSS 

DEPI 

90 

DY( 11 1=0 . OCO 

CEPI 

ICO 

DO  10  12=1,  ICS 

DEPI 

no 

DY< 11 )=DY( 11 )  +  A( 11 ,12 1*Y( 12  1 

CEPI 

120 

RETURN 

CEPI 

130 

END 

DERI 

140 
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SUBROUTINE  LINEQ  (AA.NN.X) 

LINE 

10 

c 

THIS  ROUTINE  SOLVES  THE  SYSTEM  OF  LINEAR 

LINE 

20 

c 

ECUATICNS  A*X=B  WHERE  A  IS  THE  ARRAY  CONSISTING  OF 

LINE 

30 

c 

THE  FIRST  NN  COLUMNS  OF  AA  AND  B  IS  THE  LAST 

COLUMN 

LINE 

40 

c 

OF  AA. 

LINE 

50 

IMPLICIT  REAL*8(A-H,0-Z) 

LINE 

60 

REALMS  AA( 20  >21 ) ,X( 20 ) 

LINE 

70 

M=1 

LINE 

80 

N=NN 

LINE 

90 

EFS=1 .0-4 

LINE 

100 

NPLUSt1=N*M 

LINE 

110 

DETER=1.D0 

LINE 

120 

DO  40  K  =  1,N 

LINE 

130 

DETER=DETER«AA(K,K) 

LINE 

140 

IF  ( tlABS(  AA!  K  >K  ) ) .  GT.  EPS )  GO  TO  10 

LINE 

150 

WRITE  (6,60) 

LINE 

160 

STOP 

LINE 

170 

10 

CONTINUE 

LINE 

180 

KPl=Xfl 

LINE 

190 

DO  20  J-KP1 ,NPLUSM 

LINE 

200 

20 

A4(  K ,  J )  =A.\(  K ,  J  )/AA(  K  ,K ) 

LINE 

210 

AA(K,K)=1.D0 

LINE 

220 

DO  40  1=1. N 

LINE 

230 

IF  (I.EQ.K.CP.AA(ItK).EQ.O.OO)  GO  TO  40 

LINE 

240 

DO  30  J=KP1,NPLUSM 

LINE 

250 

30 

AA(I,J)=AA(1,J)-AA(I,K)*AA(K.J) 

LINE 

260 

AA(I,K)=O.DO 

LINE 

270 

40 

CONTINUE 

LINE 

280 

DO  50  J=1,N 

LINE 

290 

50 

X(  J)  =  AA(  J,N*1) 

LINE 

300 

RETURN 

LINE 

310 

C 

LINE 

320 

C 

LINE 

330 

60 

FORMAT  ( IX, ’ALMOST  SINGULAR  MATRIX  ENCOUNTERED 

IN  LIME*' ) 

LINE 

340 

END 

LINE 

350 
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SUDPC'JTINE  DSDIFF  <  F  ,H  ,H  ,X ,  Y ,  EPS  ,S  ,  NEMH  >  DSDI 

C  DSOI 

C  F  IS  TIIC  NAME  OF  A  SUBROUTINE  CALLED  BY  'CALL  FfX.Z.DZ)'  WHICH  DSDI 

C  STORES  IN  THE  VECTOR  DZ  THE  N  COMPONENTS  OF  THE  DERIVATIVE  DSOI 

C  DZ/OX  ACCORDING  TO  THE  DIFFERENTIAL  EQUATION  WHICH  IS  BEING  DSDI 

C  SOLVED.  DZ/OX-F<X,ZI.  DSDI 

C  X,  Z.  AND  DZ  KUST  BE  DOUBLE  FRECISICN.  OSOI 

C  DSDI 

C  N  IS  THE  ORDER  OF  THE  SYSTEM  OF  DIFFERENTIAL  EQUATIONS.  DSDI 

C  N  LUST  DE  NO  GREATER  THAN  MAXCRD  WHICH  IS  SET  BELOW.  DSOI 

C  DSDI 

C  H  IS  THE  BASIC  STEP  SIZE.  DSDI 

C  DSOI 

C  X  AND  Y( VECTOR)  ARE  THE  INITIAL  VALUES.  DSDI 

C  DSDI 

C  EPS  AND  S( VECTOR)  APE  THE  EPROR  BOUNDS.  DSDI 

C  DABS(EFS)  SHOULD  BE  NO  SMALLER  THAN  1.0D-13.  DSDI 

C  DSDI 

C  NENH  IS  A  FLAG  WHICH  IS  SET  EQUAL  TO  .TRUE.  IF  THE  STEP  SIZE  USED  DSDI 
C  USSO  BY  OSDIFF  IS  DIFFERENT  FROM  THE  STEP  SIZE  H  GIVEN  IN  THEDSDI 

C  PARAMETER  LIST.  NENH  IS  SET  EQUAL  TO  .FALSE.  OTHERWISE.  DSDI 

C  OSD  I 

IMPLICIT  REAL*8(A-H,0-Z)  DSDI 

REAL-3  Y(N),S(H)  DSDI 

PEAL-6  YL<25>  DSDI 

REAL-6  PZC  25 )  DSDI 

PEA.L-3  YA(  25  >  »YM(  25 )  >DY(  25 )  >DT<  25.7 ) .  YG(  8 . 25  ),YH(6.25),0(7)  DSDI 

INTEGER  R .SR  DSDI 

LOGICAL  NENH  DSDI 

LOGICAL  KCNV.BO.BH  DSDI 

LOGICAL  EPSERR  DSDI 

DATA  M‘XORO/25/  DSDI 

DATA  EPSERR/. FALSE./  DSDI 

C*  **»*»-»  *** 

DSDI 

EACH  CALL  OF  DSDIFF  PERFORMS  ONE  INTEGRATION  STEP  OF  THE  DSDI 


EQUATION  OY/OX=F( X, Y )  ACCORDING  TO  DIE  METHOD  OF  R.  BULIRSCH  AND  DSDI 
J.  STOER  (NUMERISCHE  MATHEMATIK,  IN  PRESS).  THE  STEP  SIZE  WILL  DSDI 
BE  LESS  THAN  OR  EQUAL  TO  H.  THE  PROGRAM  TAKES  THE  FIRST  OF  THE  DSDI 

NUMBERS  H,  H/2 »  H/4 . AS  STEP  SIZE  FOR  WHICH  NO  MORE  THAN  9  DSDI 

EXTRAPOLATION  STEPS  ARE  NEEOEO  TO  OBTAIN  A  SUFFICIENTLY  ACCURATE  DSDI 


RESULT.  IF  THE  STEP  SIZE  USED  IS  DIFFERENT  THAN  THE  STEP  SIZE  DSDI 
GIVEN  IN  THE  PARAMETER  LIST,  THEN  THE  LOGICAL  FLAG  NEKH  WILL  BE  DSDI 
SET  EQUAL  TO  .TRUE.,  OTHERWISE  IT  WILL  BE  SET  EQUAL  TO  .FALSE..  DSDI 
X  AND  Y  ARE  THE  INITIAL  VALUES  FOR  THE  STEP  TO  BE  COMPUTED.  AFTERDSDI 
LEAVING  THE  SUBROUTINE,  THE  ORIGINAL  VALUES  OF  THE  PARAMETERS  X  DSDI 
AND  Y  WILL  HAVE  BEEN  REPLACED  BY  X*H'  AND  Y(X+H’>,  RESPECTIVELY,  DSDI 
WIERE  H'  IS  THE  STEP  SIZE  ACTUALLY  USEO.  IN  ADDITION  THE  STEP  DSDI 
SIZE  WILL  HAVE  BEEN  CHANGED  AUTOMATICALLY  TO  AN  ESTIMATED  OPTIMAL  DSDI 
STEP  SIZE  FOR  THE  NEXT  INTEGRATION  STEP.  THE  ARRAY  S  ANO  THE  OSDI 


10 

20 

30 

40 

50 

60 

70 

80 

90 

100 

110 

120 

130 

140 

150 

160 

170 

160 

190 

200 

210 

220 

230 

240 

250 

260 

270 

200 

290 

300 

310 

320 

330 

340 

350 

360 

370 

380 

390 

400 

410 

420 

430 

440 

450 

460 

470 

480 

450 

500 
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C  CONSTANT  EPS  ARE  USED  TO  CONTROL  THE  ACCURACY  OF  THE  COMPUTED  DSDI  510 

C  VALUES.  THE  SUBROUTINE  IS  LEFT,  IF  FOR  ALL  1=1,  2 . N  TWO  DSDI  520 

C  SUCCESSIVE  VALUES  FOR  YII)  OIFFER  AT  MOST  BY  AN  AMOUNT  EPS*S(I>.  DSOI  530 

C  EPS  SHCULO  NOT  BE  SMALLER  THAN  1.0D-13.  FOR  THE  FIRST  INTEGRA-  OSDI  540 

C  TION  STEP  IT  IS  ADVISABLE  TO  SET  S(I)=0.0.  BEFORE  RETURN  TO  THE  DSDI  550 

C  CALLING  PROGRAM,  THE  ARRAY  S  WILL  HAVE  HAD  ITS  CONTENTS  MODIFIED  DSDI  560 

C  S3  THAT  S(I)=MAX(S(I),ABS(Y<I,X)>,  WHERE  THE  MAXIMUM  IS  TAKEN  OVEROSDI  570 

C  THE  INTEGRATION  INTERVAL  (X.X+H*).  DSDI  580 

C  DSDI  590 


C*«**«**»*«***«K«M«*«*ftM«*M*«K*«M*«*K******«««»*»ft***»*N««**K»*K*«»««N**DSDI  600 


C 

IF  (N.GT.O.ANO.N.LE.MAXORO)  GO  TO  10 
WRITE  (6,210) 

STOP 

10  E=DAB5( EPS ) 

IF  (E.GE.1.0D-I3)  GO  TO  30 
IF  (EPSERR)  GO  TO  20 
EPS£RR= .TRUE . 

WRITE  (6.220) 

20  E=1 . OD-13 

C 

30  CALL  F  (X.Y.DZ) 

BH=. FALSE. 

NEWHs. FALSE. 

DO  40  1=1, N 
40  YA( I )=Y( I ) 

50  A=X+H 

FC=1 . 5 
EO=. FALSE. 

M=1 
R  =  2 
SR  =  3 
J  =  -l 

DO  190  Jl=l,10 

J=J1-1 

D( 2  1  =  2.25 

IF  (CO)  P(2)=4. 0/0(2) 

□  (41=4.0*0(2) 

0(  6 )=4. 0  3( 4 ) 

KONV=J.GT.2 

IF  (J.LE.6I  GO  TO  60 

L=6 

D( 7  )  =  64.0 
FC=.6D0*FC 
GO  TO  70 
60  L  =  J 

D( L+l )=M*M 
70  M=2«M 

G=H/0BLE(F  LOAT ( M ) ) 

B=2 . 0*G 


DSDI  610 
DSDI  620 
OSDI  630 
DSDI  640 
DSDI  650 
OSDI  660 
DSDI  670 
DSDI  660 
DSDI  690 
DSDI  700 
DSDI  710 
DSDI  720 
DSDI  730 
DSDI  740 
DSDI  750 
DSDI  760 
DSDI  770 
DSDI  780 
DSDI  790 
DSDI  800 
DSOI  610 
DSDI  820 
DSDI  830 
DSDI  840 
DSDI  850 
DSDI  860 
DSDI  870 
DSDI  880 
DSDI  690 
DSDI  900 
DSDI  910 
DSDI  920 
OSDI  930 
DSDI  940 
DSDI  950 
DSDI  960 
OSDI  970 
QSDI  980 
DSDI  990 
DSDI1000 
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IF  (BII.AHO.  J.LT.8)  GO  TO  120 

KK=(M-2 )/2 

M=M-1 

00  60  1=1, N 
YL<  I  1  =  YA(  I ) 

80  YtK  I  )  =  YA(  I  >+G*OZ(  I ) 

IF  (I1.LE.0)  GO  TO  140 
CO  110  K=1 ,M 

CALL  F  ( X+G*DBLE( FLOAT) K 1 ) ■ YM.DY ) 

00  90  1=1, N 
U  =  YL(IH3»0Y(I1 
YL(  I 1  =  YM< 1 1 
YtK  I  )=U 
U=DA03(U1 

90  S(i)=c::AX’itu,s(in 

IF  (K.KE.KK.CB.K.EQ.Zl  GO  TO  110 
JJ  =  JJU 
00  100  1=1, N 

yiii  jj*i,i)=ywi> 

100  YG( JJ+1 , I )  =  YL( 1 1 
110  CONTINUE 
GO  TO  140 

120  00  130  1=1, N 

YtK  I  >  =  YH(  J*1,I) 

130  YL(  I  )=YG(  J*1,I  t 
140  CALL  F  (A.YM.DYl 
00  160  1=1, N 
V=OT( 1,1) 

OT(I,1)  =  0.5»<Y)1(I]*YL(I)*G*OY(I1) 

C=DT (1,1) 

TA=C 

IF  (L.LE.O)  GO  TO  170 
DO  160  K=1,L 
B1=0(K*1 1*V 

a=ni-c 

u=v 

IF  (B.FQ.O.O)  GO  TO  150 

B=(C-V1/B 

U=C*3 

C=B1*B 

150  V=0T(I,K  +  11 

DT(I,K  +  1)=U 
160  TA=U+TA 

170  IF  (OAES(Y(II-TA).GT.E*DABS(S(Ili)  KONV=. FALSE. 
180  Y(I1=TA 

IF  (K0NV1  GO  TO  200 

0(31=4.0 

0(51=16.0 

B0=.N0T.S0 

M=R 


DSDI1010 
DSDI1020 
OSDI1030 
DS0I1040 
DSDI1050 
0SQI1060 
DS0I1070 
DSDI1080 
DSDI1090 
0SDI1 100 
03011110 
OS0I1120 
DS0I1130 
DS0I1140 
0SDI1150 
0SDI1160 
DS0I1 1 70 
D3DI1180 
0G0I1150 
DSDI1200 
05011210 
DSDI1220 
DSDI1230 
DSDI1240 
DSDI1250 
DSDI1260 
DSDI1270 
0SDI1280 
DSDI1290 
DSDI1300 
DSDI1310 
OSDI1320 
DSDI1330 
DSDI1340 
0S0I1350 
D3DI1360 
DSDI1370 
DS0I1380 
DSDI1390 
OSDI140O 
0S0I1410 
DSOI1420 
DSOI1430 
DSDI1440 
DSOI1450 
DSDI1460 
DSOI1470 
0S0I1480 
DSDI1490 
DSDI1500 


o  n  r> 


190 


2  00 


210 

220 


R  =  5R 
SR=2-M 
Bil=.N0T.6H 
NEWH= . TRUE . 
H-O.S-H 
CO  TO  50 
H=FC#H 
X  =  A 

RETURN 


FORMAT  (IX,1 ERROR  IN  OSDIFFi  ORDER  LESS  THAN  1  OR  GREATER  THAN 
1) 

FORMAT  (51HOERROR  LIMIT  TOO  SMALL  FOR  DSDIFF.  WE  USE  l.tO-13.) 
END 


DSDI1510 

DSD11520 

DSDI1530 

0SDI1540 

DSDI1550 

0SDI1560 

DSDI1570 

DSDI15S0 

DSDI1590 

0SDI1600 

DSDI1610 

0S0I1620 

25'DSOI1630 

0S0I1640 

DS0I1650 

DSDI1660 
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MISSION 

of 

Rome  Air  Development  Center 

RAPC  plant*  and  executes  research,  development,  test  and 
selected  acquisition  programs  In  support  oi  Command,  Control 
Communications  and  Intelligence  (C3I)  activities.  Technical 
and  engineering  support  within  areas  oi  technical  competence 
is  provided  to  ESP  Program  0 ibices  (POs)  and  other  ESP 
elements.  The  principal  technical  mission  areas  are 
communications,  electromagnetic  guidance  and  control,  sur¬ 
veillance  oi  ground  and  aerospace  objects,  intelligence  data 
collection  and  handling,  inionmation  system  technology, 
ionospheric  propagation,  solid  state  sciences,  microwave 
physics  and  electronic  reliability,  maintainability  and 
compatibility. 


